PDD Profiler Driven Development

There are many extreme software development strategies out there. PDD can mean Panic Driven, or Performance Driven Development. The first one should be avoided, the second is a MSR paper dealing with performance modelling techniques. Everyone writes tests for their software because they have proven to be useful, but why are so few people profiling their own code? Is it too difficult?

I have created a C# console project named FileWriter https://github.com/Alois-xx/FileWriter which writes 5000 100KB files (= 500MB) to a folder with different strategies. How do you decide which strategy is the most efficient and/or fastest one? As strategies we have

  • Single Threaded
  • Multi Threaded with TPL Tasks
  • Multi Threaded with Parallel.For
    • 2,3,4,5 Threads

The serial version is straightforward. It creates a n files and writes in each file a string until the file size is reached. The code is located at https://github.com/Alois-xx/FileWriter/blob/master/DataGenerator.cs, but the exact details are not important.

Lets measure how long it takes:

As expected the serial version is slowest and the parallel versions are faster. The fastest one is the one which uses 3 threads. Based on real world measurements we have found that 3 threads is the fastest solution. OK problem solved. Move on and implement the next feature. That was easy.

But, as you can imagine there is always a but: Are the numbers we have measured reasonable? Lets do some math. We write 500 MB with the parallel version in 11s. That gives a write speed of 45 MB/s. My mental performance model is that we should be able to max out a SATA SSD which can write 430 MB/s.

This is off nearly by a factor 10 what the SSD is able to deliver. Ups. Where did we loose performance? We can try out the Visual Studio CPU Profiler:

That shows that creating and writing to the file are the most expensive parts. But why? Drilling deeper will help to some extent, but we do not see where it did wait for disk writes because the CPU Profiler is only showing CPU consumption. We need another tool. Luckily there is the Concurrency Visualizer extension for free on the Visual Studio Marketplace, which can also show Thread blocking times.

After installation you can start your project from the Analyze menu with your current debugging settings.

We try to stay simple and start FileWriter with the single threaded strategy which is easiest to analyze

FileWriter -generate c:\temp serial

In the Utilization tab we get a nice summary of CPU consumption of the machine. In our case other processes were consuming a lot of CPU while we were running. White is free CPU, dark gray is System and light gray are all other processes. Green is our profiled process FileWriter.exe.

The first observation is that once we start writing files other processes become very busy which looks correlated. We come back to that later.

The Threads view shows where the Main Thread is blocked due to IO/Waiting/…

You can click with the mouse on the timeline of any thread to get in the lower view the corresponding stack trace. In this case we are delayed e.g. by 2,309 s by a call to WriteFile. For some reason the .NET 6.0 symbols could not be resolved but it might also be just a problem with my symbol lookup (there should be a full stack trace visible). Personally I find it difficult to work with this view, when other processes are consuming the majority of CPU. But since Concurrency Visualizer uses ETW there is nothing that can stop us to open the Concurrency Visualizer generated ETW files from the folder C:\Users\\Documents\Visual Studio 2022\ConcurrencyVisualizer\ with WPA.

We see that FileWriter (green) starts off good but then MsMpEng.exe (red = Defender Antivirus) is kicking in and probably delaying our File IO significantly. In the middle we see System (lila = Windows OS) doing strange things (Antivirus again?).

Do you remember that our SSD can write 430 MB/s? The Disk Usage view in WPA shows disk writes. Our process FileWriter writes 500 MB of data, but there are only 4 MB written in 53ms (Disk Service Time). Why are we not writing to the SSD? The answer is: The Operating System is collecting a bunch of writes and later flushes data to disk when there is time or the amount of written data exceeds some internal threshold. Based on actual data we need to revise our mental performance model:

We write to the OS write cache instead of the disk. We should be able to write GB/s! The actual disk writes are performed by the OS asynchronously.

Still we are at 45 MB/s. If the disk is not the problem, then we need to look where CPU is spent. Lets turn over to the gaps where the green line of FileWriter writing on one thread to disk vanishes. There seems to be some roadblock caused by the System process (Lila) which is the only active process during that time of meditation:

When we unfold the Checkpoint Volume stacktag which is part of my Overview Profile of ETWController we find that whenever NTFS is optimizing the Free Bitmap of the NTFS volume it is traversing a huge linked list in memory which blocks all File IO operations for the complete disk. Do you see the bluish parts in the graph which WPA shows where NTFS Checkpoint volume samples are found in the trace? Automatic highlighting of the current table selection in the graph is one of the best features of WPA. The root cause is that we delete the old files of the temp folder in FileWriter from a previous run which produces a lot of stale NTFS entries which, when many new files are created, at some point cause the OS to wreak havoc on file create/delete performance. I have seen dysfunctional machines which did have a high file churn (many files per day were created/deleted) where all file create/delete operations did last seconds instead of sub milliseconds. See a related post about that issue here: https://superuser.com/questions/1630206/fragmented-ntfs-hdd-slow-file-deletion-600ms-per-file. I did hit this issue several times when I cleaned up ETW folders which contain for each ETL file a large list of NGENPDB folders which the .ni.pdb files. At some point Explorer is stuck for many minutes to delete the last files. The next day and a defrag later (the NTFS tables still need defragmentation even for SSDs) things are fast again. With that data we need to revisit our mental performance model again:

Creating/Deleting large amount of files might become very slow depending on internal NTFS states which are in memory structures and manifest as high CPU consumption in the Kernel inside the NtfsCheckpointVolume method.

That view is already good, but we can do better if we record ETW data on our own and analyze it at scale.

How about a simple batch script that

  • Starts ETW Profiling
  • Executes your application
  • Retrieves the test execution time as return value from your application
  • Stops ETW Profiling
  • Write an ETW file name with the measured test duration and test case name
  • Extract ETW Data for further analysis

That is what the RunPerformanceTests.cmd script of the FileWriter project does. The tested application is FileWriter.exe which writes files with different strategies to a folder. When it exits it uses as exit code of the process the duration of the test case in ms. For more information refer to ProfileTest.cmd command line help.

FileWriter and Measurement Details

Read on if you are eager for details how the scripts generate ETW measurement data or go to the next headline.

RunPerformanceTests.cmd
call ProfileTest.cmd CreateSerial       FileWriter.exe -generate "%OutDir%" serial
call ProfileTest.cmd CreateParallel     FileWriter.exe -generate "%OutDir%" parallel
call ProfileTest.cmd CreateParallel2    FileWriter.exe -generate "%OutDir%" parallel -threads 2
...
call ProfileTest.cmd CreateTaskParallel FileWriter.exe -generate "%OutDir%" taskparallel

The most “complex” of ProfileTest.cmd is where several ETW profiles of the supplied MultiProfile.wprp are started. The actual start command in the ProfileTest.cmd script is

wpr -start MultiProfile!File -start MultiProfile.wprp!CSWitch -start MultiProfile.wprp!PMCLLC 

But with shell scripting with the correct escape characters it becomes this:

"!WPRLocation!" -start "!ScriptLocation!MultiProfile.wprp"^^!File -start "!ScriptLocation!MultiProfile.wprp"^^!CSwitch -start "!ScriptLocation!MultiProfile.wprp"^^!PMCLLC

This line starts WPR with File IO tracing, Context Switch tracing to see thread wait/ready times and some CPU counters for Last Level Cache misses. Disk and CPU sampling is already part of these profiles.

The PMCLLC profile can cause issues if you are using VMs like HyperV. To profile low level CPU features you need to uninstall all Windows OS HyperV Features to get Last Level Cache CPU tracing running.

You can check if you have access to your CPU counters via

C:\>wpr -pmcsources
Id  Name                        
--------------------------------
  0 Timer                       
  2 TotalIssues                 
  6 BranchInstructions          
 10 CacheMisses                 
 11 BranchMispredictions        
 19 TotalCycles                 
 25 UnhaltedCoreCycles          
 26 InstructionRetired          
 27 UnhaltedReferenceCycles     
 28 LLCReference                
 29 LLCMisses                   
 30 BranchInstructionRetired    
 31 BranchMispredictsRetired    
 32 LbrInserts                  
 33 InstructionsRetiredFixed    
 34 UnhaltedCoreCyclesFixed     
 35 UnhaltedReferenceCyclesFixed
 36 TimerFixed                  

If you get a list with just one entry named Timer then you will not get CPU counters, because you have installed one or several HyperV features. In theory this issue should not exist because this was fixed with Windows 10 RS 5 in 2019 (https://docs.microsoft.com/en-us/archive/blogs/vancem/perfview-hard-core-cpu-investigations-using-cpu-counters-on-windows-10), but I cannot get these CPU counters when HyperV is active in Windows 10 21H2.

The easiest fix is to remove the “!ScriptLocation!MultiProfile.wprp”^^!PMCLLC from the ProfileTest.cmd script, or you modify the MultiProfile.wprp to set in all HardwareCounter nodes

<HardwareCounter Id="HardwareCounters_EventCounters" Base="" Strict="true">

the value Strict=”false” to prevent WPR from complaining when it could not enable hardware counters. The MultiProfile.wprp profile uses nearly all features of WPR which are possible according to the not really documented xsd Schema to streamline ETW recording to your needs. The result of this “reverse” documenting is an annotated recording profile that can serve as base for your own recording profiles. Which profiles are inside it? You can query it from the command line with WPR:

D:\Source\FileWriter\bin\Release\net6.0>wpr -profiles MultiProfile.wprp

Microsoft Windows Performance Recorder Version 10.0.19041 (CoreSystem)
Copyright (c) 2019 Microsoft Corporation. All rights reserved.

   Default                   
               (CPU Samples/Disk/.NET Exceptions/Focus)
   CSwitch    +(CPU Samples/Disk/.NET Exceptions/Focus/Context Switch)
   MiniFilter +(CPU Samples/Disk/.NET Exceptions/Focus/MiniFilter)
   File       +(CPU Samples/Disk/.NET Exceptions/Focus/File IO)
   Network    +(CPU Samples/Disk/.NET Exceptions/Focus/Network)
   Sockets    +(CPU Samples/Disk/.NET Exceptions/Focus/Sockets)
   VirtualAlloc (Long Term)
   UserGDILeaks (Long Term)
   PMCSample  
     PMC Sampling for PMC Rollover + Default
   PMCBranch  
     PMC Cycles per Instruction and Branch data - Counting
   PMCLLC                      
     PMC Cycles per Instruction and LLC data - Counting
   LBR                         
     LBR - Last Branch Record Sampling

Or you can load the profile into WPRUI which is part of the Windows SDK where they will show up under the Custom Measurements node:

All profiles can be mixed (except Long Term) to record common things for unmanaged and .NET Code with the least amount of data. A good choice is the default profile which enables CPU sampling, some .NET events, Disk and Window Focus events. These settings are based on years of troubleshooting, and should be helpful recording settings also in your case. The supplied MS profiles are recording virtually everything which is often a problem that the trace sizes are huge (many GB) or, if memory recording is used, the history is one minute or even less on a busy system which has problems.

Generated Data

RunPerformanceTests.cmd saves the ETW data to the C:\Temp folder

After data generation it calls ExtractETW.cmd which is one call to ETWAnalyzer which loads several files in parallel and creates Json files from the ETW files

ETWAnalyzer -extract all -fd c:\temp\*Create*.etl -symserver MS -nooverwrite

This will generate JSON files in C:\Temp\Extract which can be queried with ETWAnalyzer without the need to load every ETL file into WPA.

As you can see the 500+ MB ETL files are reduced to ca. 10 MB which is a size reduction of a factor 50 while keeping the most interesting aspects. There are multiple json files per input file to enable fast loading. If you query with ETWAnalyzer e.g. just CPU data the other *Derived* files are not even loaded which keeps your queries fast.

Working With ETWAnalyzer and WPA on Measured Data

To get an overview if all executed tests suffer from a known issue you can query the data from the command line. To get the top CPU consumers in the Kernel not by method name but by stacktag you can use this query

EtwAnalyzer -dump CPU -ProcessName System -StackTags * -TopNMethods 2

I am sure you ask yourself: What is a stacktag? A stacktag is a descriptive name for key methods in a stacktrace of an ETW event. WPA and TraceProcessing can load a user configured xml file which assigns to a stacktrace a descriptive (stacktag) name, or Other if no matching stacktag rule could be found. The following XML fragment defines e.g. the above NTFS Checkpoint Volume stacktag:

  <Tag Name="NTFS Checkpoint Volume">
	<Entrypoint Module="Ntfs.sys" Method="NtfsCheckpointVolume*"/>
  </Tag>

See default.stacktags of ETWAnalyzer how it can be used to flag common issues like OS problems, AV Scanner Activity, Encryption Overhead, … with descriptive stacktags. If more than one stacktag matches the first matching stacktag, deepest in the stacktrace, wins. You can override this behavior if you set a priority value for a given tag.

Based on the 7 runs we find that the CheckpointVolume issue is part of every test which blocks File writes up to 6+ seconds. Other test runs do no suffer from this which makes comparison of your measured numbers questionable. Again we need to update our mental performance model:

Tests which are slower due to CheckpointVolume overhead should be excluded to make measurements comparable where this internal NTFS state delay is not occurring.

The exact number would be the sum of CPU+Wait, because this operation is single threaded we get exact numbers here. ETWAnalyzer can sum method/Stacktags CPU/Wait times with the -ShowTotal flag. If we add Method, the input lines are kept, if we use Process only process and file totals are shown.

EtwAnalyzer -dump CPU -ProcessName System -StackTags *checkpoint* -ShowTotal method

Now we know that file IO is delayed up to 7461 ms without the need to export the data to Excel. ETWAnalyzer allows you to export all data you see at the console to a CSV file. When you add -csv xxx.csv and you can aggregate/filter the data further as you need it. This is supported by all -Dump commands!

We can also add methods to the stacktags by adding -methods *checkpoint* to see where the stacktag has got its data from

EtwAnalyzer -dump CPU -ProcessName System -StackTags *checkpoint* -methods *checkpoint*
The method NtfsCheckpointVolume has 873 ms, but the stacktag has only 857 ms. This is because other stacktags can “steal” CPU sampling/wait data which is attributed to it. This is curse and feature rolled into one. It is great because if you add up all stacktags then you get the total CPU/Wait time for a given process across all threads. It is bad, because expensive things might be looking less expensive based on one stacktag because other stacktags have “stolen” some CPU/Wait time.

That is all nice, but what is our FileWriter process doing? Lets concentrate on the single threaded use case by filtering the files with -FileDir or -fd which is the shorthand notation for it to *serial*. You need to use * in the beginning, because the filter clause uses the full path name. This allows you to query with -recursive a complete directory tree where -fd *Serial*;!*WithAV* -recursive defines a file query starting from the current directory for all Serial files but it excludes folders or files which contain the name WithAV. To test which files are matching you can use -dump Testrun -fd xxxx -printfiles.

EtwAnalyzer -dump CPU -ProcessName FileWriter -stacktags * -Filedir *serial*

The test takes 12s but it spends 2,2s yet alone with Defender Activity. Other stacktags are not giving away a clear problem we could drill deeper. One interesting thing is that we spend 750ms in thread sleeps. A quick check in WPA shows

that these are coming from the .NET Runtime to recompile busy methods later on the fly. Stacktags are great to flag past issues and categorize activity, but they also tend to decompose expensive operations into several smaller ones. E.g. if you write to a file you might see WriteFile and Defender tags splitting the overall costs of a call to WriteFile. But what stands out that on my 8 Core machine a single threaded FileWriter fully utilizes my machine with Defender activity. We need to update our mental performance model again:

Writing to many small files on a single thread will be slowed down by the Virus Scanner a lot. It also generates a high load on the system which slows down everything else.

Since we did write the application we can go down to specific methods to see where the time is spent. To get an overview we use the top level view of WPA of our CreateFile method

In WPA we find that

  • CreateFiles
    • CPU (CPU Usage) 10653 ms
    • Wait Sum 672 ms
    • Ready Time Sum 1354 ms
      • This is the time a thread did wait for a CPU to become free when all CPUs were used

which adds up to 12,679 s which pretty closely (16ms off) matches the measured 12,695 s. It is always good to verify what you did measure. As you have seen before we have already quite often needed to change our mental performance model. The top CPU consumers in the CreateFiles method are

  • System.IO.StreamWriter::WriteLine
  • System.IO.Strategies.OSFileStreamStrategy::.ctor
  • System.IO.StreamWriter::Dispose
  • System.Runtime.CompilerServices.DefaultInterpolatedStringHandler::ToStringAndClear
  • System.Buffers.TlsOverPerCoreLockedStacksArrayPool`1[System.Char]::Rent
  • System.Runtime.CompilerServices.DefaultInterpolatedStringHandler::AppendFormatted
  • System.IO.Strategies.OSFileStreamStrategy::.ctor
  • System.IO.StreamWriter::WriteLine

What is surprising is that ca. 2s of our 12s are spent in creating interpolated strings. The innocent line

 string line = $"echo This is line {lineCount++}";

costs ca. 16% of the overall performance. Not huge, but still significant. Since we are trying to find out which file writing strategy is fastest this is OK because the overhead will be the same for all test cases.

We can get the same view as in WPA for our key methods with the following ETWAnalyzer query

EtwAnalyzer -dump CPU -ProcessName FileWriter -fd *serial* -methods *StreamWriter.WriteLine*;*OSFileStreamStrategy*ctor*;*StreamWriter.WriteLine*;*DefaultInterpolatedStringHandler*AppendFormatted*;*TlsOverPerCoreLockedStacksArrayPool*Rent;*DefaultInterpolatedStringHandler.ToStringAndClear;*StreamWriter.Dispose;*filewriter* -fld

You might have noticed that the method names in the list have :: between the class and method. ETWAnalyzer uses always . which is less typing and is in line with old .NET Framework code which used :: for JITed code and . for precompiled code.

Additionally I have smuggled -fld to the query. It is the shorthand for -FirstLastDuration or -fld which shows the difference between the Last-First time the method did show up in CPU Sampling or Context switch data. Because we know that our test did only measure DataGenerator.CreateFile calls we see in the Last-First column 12,695 s which is down to the ms our measured duration! This option can be of great value if you want to measure parallel asynchronous activity wall clock time. At least for .NET code the generated state machine classes of the C# compiler contain central methods which are invoked during init and finalization of asynchronous activities which makes it easy to “measure” the total duration even if you have no additional trace points at hand. You can add to -fld s s or -fld local local or -fld utc utc to also show the first and last time in

  • Trace Time (seconds)
  • Local Time (customer time in the time zone the machine was running)
  • UTC Time

See command line help for further options. ETWAnalyzer has an advanced notation of time which can be formatted in the way you need it.

The WPA times for Wait/Ready are the same in ETWAnalyzer, but not CPU. The reason is that to exactly judge the method CPU consumption you need to look in WPA into CPU Usage (Sampled) which shows per method CPU data, which is more accurate. ETWAnalyzer merges this automatically for you.

CPU usage (Precise) Trap

The view CPU usage (Precise) in WPA shows Context switch data which is whenever your application called a blocking OS call like Read/WriteFile/WaitForSingleObject, … your thread is switched off the CPU which is called context switch. You will see in this graph therefore only (simplification!) methods which were calling a blocking OS method. If you have e.g. a busy for loop in method int Calculate() and the next called method is Console.WriteLine, like this,

int result = Calculate();
Console.WriteLine(result)

to print the result then you will see in CPU usage (Precise) all CPU attributed to Console.WriteLine because that method was causing the next blocking OS call. All other non blocking methods called before Console.WriteLine are invisible in this view. To get per method data you need to use the CPU data of CPU sampling which gets data in a different way to much better measure CPU consumption at method level. Is Context Switch data wrong? No. The per thread CPU consumption is exact because this is a trace of the OS Scheduler when he moves threads on/off a CPU. But you need to be careful to not interpret the method level CPU data as true CPU consumption of that method.

CPU Usage (Sampled) Trap

Lets compare both CPU views (Sampled) vs (Precise)

StreamWriter.Dispose method consumes 887 ms in Sampling while in Context Switch view we get 1253 ms. In this case the difference is 41%! With regards to CPU consumption you should use the Sampled view for methods to get the most meaningful value.

Every WPA table can be configured to add a Count column. Many people fell into the trap to interpret the CPU Sampling Count column as number of method calls, while it actually just counts the number of sampling events. CPU sampling works by stopping all running threads 1000 (default) times per second, take a full stacktrace and write that data to ETW buffers. If method A() did show up 1000 times then it gets 1000*1ms CPU attributed. CPU sampling has statistical bias and can ignore e.g. method C() if it is executing just before, or after, the CPU sampling event did take a stack trace.

ETWAnalyzer Wait/Ready Times

The method list shown by ETWAnalyzer -Dump CPU -methods xxx is a summation across all threads in a process. This serves two purposes

  • Readability
    • If each thread would be printed the result would become unreadable
  • File Size
    • The resulting extracted Json file would become huge

This works well for single threaded workloads, but what about 100 threads waiting for something? WPA sums all threads together if you remove the thread grouping. ETWAnalyzer has chosen a different route. It only sums all non overlapping Wait/Ready times. This means if more than one thread is waiting, the wait is counted only once.

The Wait/Ready times for methods for multi threaded methods are the sum of all non overlapping Wait/Ready times for all threads. If e.g. a method was blocked on many threads you will see as maximum wait time your ETW recording time and not some arbitrary high number which no one can reason about. It is still complicated to judge parallel applications because if you have heavy thread over subscription you will see both, Wait and Ready, times to reach your ETW recording time. This is a clear indication that you have too many threads and you should use less of them.

Make Measurements Better

We have found at least two fatal flaws in our test

  • Anti Virus Scanner is consuming all CPU even in single threaded test
    • Add an exclude rule to the folder via PowerShell
      • Add-MpPreference -ExclusionPath “C:\temp\Test”
    • To remove use
      • Remove-MpPreference -ExclusionPath “C:\temp\Test”
  • Some test runs suffer from large delays due to CheckpointVolume calls which introduces seconds of latency which is not caused by our application
    • Change Script to delete temp files before test is started to give OS a chance to get into a well defined file system state

When we do this and we repeat the tests a few times we get for the single threaded case:

Nearly a factor 3 faster for changing not a single line of code is pretty good. The very high numbers are coming from overloading the system when something else is running in the background. The 12s numbers are for enabled Defender while < 10s are with a Defender exclusion rule but various NtfsCheckpoint delays. When we align everything correctly then we end up with 7s for the single threaded case which is 71 MB/s of write rate. The best value is now 3,1s for 4 threads with 161 MB/s.

What else can we find? When looking at variations we find also another influencing factor during file creation:

We have a single threaded test run with 9556 ms and a faster one with 7066 ms. Besides the CheckpointVolume issue we loose another 1,3 s due to reading MFT and paged out data. The faster test case (Trace #2) reads just 0,3 ms from the SSD, while we read in the slower Run (Trace #1) 1,3s from disk. The amount of data is small (8 MB), but for some reason the SSD was handing out the data slowly. Yes SSDs are fast, but sometimes, when they are busy with internal reorg things, they can become slow. SSD performance is a topic for another blog post.

We need to update our mental performance model

File System Cache state differences can lead to wait times due to Hard faults. Check method MiIssueHardFault if the tests have comparable numbers.

Etwanalyzer -dump CPU -pn Filewriter -methods MiIssueHardFault
That are another 2 s of difference which we can attribute to File System cache state and paged out data.

Measuring CPU Efficiency

We know that 4 threads are fastest after removing random system noise from our measured data. But is is it also the most efficient one? We can dump the CPUs Performance Monitoring Counters (PMC) to console with

ETWAnalyzer -dump PMC -pn FileWriter

The most fundamental metric is CPI which is Cycles/Instructions. If you execute the the same use case again you should execute the same numbers of instructions. When you go for multithreading you need to take locks and coordinate threads which costs you extra instructions. The most energy efficient solution is therefore the single threaded solution, although it is slower.

To get an overview you can export data into a CSV file from the current directory and the initial measurements we did start with. The following query will dump the Performance Monitoring Counters (PMC) to a CSV file. You can use multiple -fd queries to select the data you need:

ETWAnalyzer -dump PMC -pn FileWriter -fd ..\Take_Test1 -fd .  -csv PMC_Summary.csv

The CSV file contains the columns Test Time in ms and CPI to make it easy to correlate the measured numbers with something else e.g. CPI in this case. This pattern is followed for all -Dump commands of ETWAnalyzer. From this data we can plot Test Timing vs CPI this is Cycles per Instruction into a Pivot Table:

The initial test where we had

  • Defender
  • CheckpointVolume

issues was not only much slower, but CPU efficiency wise much worse: 1,2 vs 0,69 (smaller is better). This proves that even if Defender would be highly efficient on all other cores and not block our code at all we still would do much worse (73%), because even when we do exactly the same, CPU caches and other internal CPU resources, are slowing us down. With this data you can prove why you are slower even when everything else is the same.

Conclusions

If you have read that far I am impressed. That is a dense article with a lot of information. I hope I have convinced you that profiling your system with ETW and not only your code for strategic test cases is an important asset in automated regression testing.

Credits: NASA, ESA, CSA, and STScI (https://www.nasa.gov/image-feature/goddard/2022/nasa-s-webb-delivers-deepest-infrared-image-of-universe-yet)

If you look only at your code you will miss the universe of things that are moving around in the operating system and other processes. You will end up with fluctuating tests which no one can explain. We are doing the same thing 10 times and we still get outliers! The key is not only to understand your own code, but also the system in which it is executing. Both need attention and coordination to achieve the best performance.

To enable that I have written ETWAnalyzer to help you to speed up your own ETW analysis. If you go for ETW you are talking about a lot of data which needs a fair amount of automation to become manageable. Loading hundreds of tests into WPA is not feasible, but with ETWAnalyzer mass data extraction and analysis never was easier. The list of supported queries is growing. If you miss something file an issue so we can take a look what the community is needing.

To come back to the original question. Is profiling too difficult? If you are using only UI based tools then the overhead to do this frequently it is definitely too much. But with the scripting approach shown by FileWriter to record data and extract it with ETWAnalyzer, to make the data queryable, is giving you a lot of opportunities. ETWAnalyzer comes with an object model for the extracted Json files. You can query for known issues directly and flag past issues within seconds without the need to load the profiling data into a graphical viewer to get an overview.

GDI/User Object Leak Tracking – The Easy Way

One of the most complex issues are User/GDI object leaks because they are nearly invisible in memory dumps. Microsoft does not want us to peek into such “legacy” technology so no Windbg plugins exist anymore. Today everything is UWP, WPF (ok that’s legacy as well) and mostly some Chrome Browser based applications. With .NET 6 WPF and WinForms support has got some love again so we can expect new programmers will hit old fashioned User/GDI leaks again. These leaks are worse than “normal” Handle leaks because Windows imposes a limit of 10K GDI and User Handles per process. After that you cannot create any more User/GDI handles and you get either exceptions or partially filled windows where content or icons are missing.

Here is an application which leaks Window and GDI Handles on every button click:

    public partial class Leaker : Form
    {
        public Leaker()
        {
            InitializeComponent();
        }

        List<Leaker> myChilds = new List<Leaker>();

        private void button1_Click(object sender, EventArgs e)
        {
            var leaker = new Leaker();
            var tmp = leaker.Handle;
            myChilds.Add(leaker);
            cCountLabel.Text = $"{myChilds.Count}";
        }
    }

In Task Manager we find that with every click we loose 6 Window handles:

These come from the

  • Leaked Form
  • Create Button
  • Created Label
  • 85 (Number) Label
  • IME Hidden Window
  • MSCTFIME Hidden Window

A very useful tool is Spy++ which is installed when you add C++ core features

via the Visual Studio Installer. Then you will find it in C:\Program Files\Microsoft Visual Studio\<Version>\<Edition>\Common7\Tools\spyxx_amd64.exe or the 32 bit version without the _amd64 postfix.

If you intend to copy Spy++ to a different machine you need these dlls

.\1033\spyxxui_amd64.dll
spyxxhk_amd64.dll
spyxx_amd64.exe

to make it work. The resource dll in the 1033 folder is essential or it wont start.

How would you identify the call stack that did lead to the allocation of additional objects? The answer is a small and great ETW tool named Performance HUD from Microsoft.

After starting

you can attach to an already running process

After a few seconds an overlay window pops up

If you double click on it the “real” UI becomes visible
Switch over to the Handles Tab
Now press in our Leaker app a few times the create button to leak some GDI/User objects
Press Ctrl+Click on the blue underlined Outstanding text for User Window objects to find the leaking call stack
Voila you have found our button1_Click handler as the bad guy as expected. If you right click on it you can e.g. send a mail with the leaking call stack as attached .hudinsight file to the developer of the faulty application. Problem solved.

See my earlier post what else you can do with Performance HUD. It is one of the tools which save a lot of time once you know that they exist.

Another great feature of Performance HUD is to analyze UI Thread hangs. It employs some sophisticated analysis to detect that the UI thread is no longer processing window messages because some CPU hungry method is executing a long method call, or that you are stuck in a blocking OS call. Today everyone is using async/await in C#. A stuck UI thread should be a thing of the past. But now you can measure and prove that it is not hanging.

Lets add to our sample application a UI hang via TPL tasks which happens only in Debug Builds

        private void cHang_Click(object sender, EventArgs e)
        {
            Task<int> t = Task.Run(async () => { await Task.Delay(5000); return 1; });
            Debug.Print(t.Result.ToString());
        }

My favorite button the Blame Someone which will unfold the stack with probably the longest stuck call:

This precisely points towards our stuck button handler which calls into Task.Result which seems not to be resolved properly but the later method calls are. The 5s hang is nicely colored and you can also pack this into an Email and send the issue to the maintainer of the corresponding code. I have seen plenty of similar issues in the past because Thread.Sleep is caught by many FXCop rules so people tend to use await with Task.Delay which is a Thread.Sleep in disguise.

This works not only with .NET Code but with any language as long as it supports pdbs and walkable stacks. It is great that this diagnostic has been added to Windows but why did it take so long? As far as I know Performance HUD is from the Office team which is a “legacy” but important application. Someone dealing with too many GDI/User object leaks was able to commit some lines to the Kernel to instrument the relevant locations with proper ETW events and we finally arrive with a feature complete leak tracer. With ETW we can track leaks

  • OS Handles (File, Socket, Pipe, …)
    • wpr -start Handle
    • Analyze with WPA. This records handle data from all processes (system wide)
  • GDI Handles (Cursor, Bitmap, Font, …)
  • User Handles (Window, Menu, Timer, …)
  • VirtualAlloc Memory (Records system wide all processes)
    • wpr -start VirtualAllocation
    • Analyze with WPA

Finally we have covered all frequent leak types with ETW events. No leak can hide anymore.

For memory mapped files there are some ETW events but unfortunately no analysis in the public version of WPA exist. These leaks are still more tricky to figure out than the rest which are now at least in principle easy. Recording the right data and not too much (ETW files > 6 GB will usually crash WPA) is still challenging but doable. Performance HUD is very well written. I have successfully processed a 80 GB ETL trace with no hiccups. It can not only attach to already running processes, but it can also analyze previously recorded data.

Now get back to work and fix your leaks and unresponsive UIs!

How Fast is your Virus Scanner?

I am sure everyone has experienced slowdowns due to Antivirus solutions, but very few are able to attribute it to the right component on your Windows box. Most people do not even know what AV solution/s are running on their machine. The end result can be a secure but slow system which is barely usable.

Is your CPU fan always on and your notebook getting hot although you are not running anything? You want to find out what is consuming so much CPU with a command line tool?

Then I have a new tool for you: ETWAnalyzer. It is using the Trace Processing Library which powers WPA (Windows Performance Analyzer). WPA is a great UI to analyze tricky systemic issues, but if you need to check a trace the n-th time if the same issue is there again you want to automate that analysis.

ETWAnalyzer is developed by Siemens Healthineers as its first public open source project. Full Disclosure: I work for Siemens Healthineers. One motivation to make it public was to show how incredibly useful ETW (Event Tracing for Windows) is to track down performance issues in production and development environments. It is one of the best technologies for performance troubleshooting Microsoft has invented, but it is valued by far too few people. ETW is part of Windows and you need no additional tools or code instrumentation to record ETW data! Starting with Windows 10 the Windows Performance Recorder (wpr.exe) is part of the OS. You have a slow system? In that case I would open an Administrator cmd shell and type

C>wpr -start CPU -start DiskIO -start FileIO

Now execute your slow use case and then stop profiling with

C>wpr -stop C:\temp\SlowSystem.etl

This will generate a large file (hundreds MB – several GB) which covers the last ca. 60-200s (depends on system load and installed memory) what your system was doing. If you have a lot of .NET applications installed stopping the first first time can take a long time (10-60 min) because it will create pdbs for all running .NET applications. Later profiling stop runs are much faster tough.

Getting the data is easy. Analyzing is hard when you do not know where you need to look at.

This is where ETWAnalyzer can help out to ask the recorded data questions about previously analyzed problems. You ask: Previously? Yes. Normally you start with WPA to identify a pattern like high CPU in this or that method. Then you (should) create a stacktag to assign a high level description what a problematic method is about. The stacktags are invaluable to categorize CPU/Wait issues and correlate them with past problems. The supplied stacktag file of ETWAnalyzer contains descriptions for Windows/Network/.NET Framework/Antivirus products even some Chrome stacktags. To get started you can download the latest ETWAnalyzer release from https://github.com/Siemens-Healthineers/ETWAnalyzer/releases und unzip it to your favorite tool directory and extend the PATH environment to the directory. If you want maximum speed you should use the self contained .NET 6 version.

Now you are ready to extract data from an ETL file with the following command:

ETWAnalyzer -extract all -filedir c:\temp\SlowSystem.etl -symserver MS -allCPU

After extraction all data is contained in several small Json files.

Lets query the data by looking at the top 10 CPU consumers

ETWAnalyzer -dump CPU -filedir c:\temp\Extract\SlowSystem.json -topN 10

The second most CPU hungry process is MsMpEng.exe which is the Windows Defender scan engine. Interesting but not necessarily a bad sign.

Lets check if any known AV Vendor is inducing waits > 10ms. We use a stacktag query which matches all stacktags with Virus in their name.

ETWAnalyzer -dump cpu -fd c:\temp\Extract\SlowSystem.json -stacktags *virus* -MinMaxWaitMs 10

We find a 1349ms slowdown caused by Defender in a cmd shell. I did observe slow process creation after I hit enter to start the fresh downloaded executable from the internet.

Explorer seems also to be slowed down. The numbers shown are the sum of all threads for CPU and wait times. If multiple threads were executing the the method then you can see high numbers for the stacktags/methods. For single threaded use cases such as starting a new process the numbers correlate pretty good with wall clock time.

The currently stacktagged AV solutions are

  • Applocker
  • Defender
  • Trend Micro
  • Symantec
  • Palo Alto
  • ESET
  • Mc Afee
  • CrowdStrike

Lets dump all slow methods with a wait time between 1300-2000ms and the timing of all process and file creation calls in explorer.exe and cmd.exe

ETWAnalyzer -dump cpu -fd c:\temp\Extract\SlowSystem.json  -stacktags *virus* -MinMaxWaitMs 1300-2000 -methods *createprocess*;*createfile* -processName explorer;cmd

Antivirus solutions usually hook into CreateProcess and CreateFile calls which did obviously slow down process creation and file operations in explorer and process creation in cmd.exe.

Was this the observed slowness? Lets dump the time when the file and process creation calls were visible the first time in the trace

ETWAnalyzer -dump cpu -fd c:\temp\Extract\SlowSystem.json  -stacktags *virus* -MinMaxWaitMs 1300-2000 -methods *createprocess*;*createfile* -processName explorer;cmd -FirstLastDuration local

That correlates pretty well with the observed slowness. To blame the AV vendor we can dump all methods of device drivers which have no symbols. Microsoft and no other AV vendor delivers symbols for their drivers. By this definition all unresolved symbols for system drivers are likely AV or hardware drivers. ETWAnalyzer makes this easy with the -methods *.sys -ShowModuleInfo query to show all called kernel drivers from our applications for which we do not have symbols :

ETWAnalyzer -dump cpu -fd c:\temp\Extract\SlowSystem.json -methods *.sys -processName explorer;cmd -ShowModuleInfo

The yellow output is from an internal database of ETWAnalyzer which has categorized pretty much all AV drivers. The high wait time comes from WDFilter.sys which is an Antivirus File System driver which is part of Windows Defender.

What have we got?

  • We can record system wide ETW data with WPR or any other ETW Recoding tool e.g. https://github.com/Alois-xx/etwcontroller/ which can capture screenshots to correlate the UI with profiling data
  • With ETWAnalyzer we can extract aggregated data for
    • CPU
    • Disk
    • File
    • .NET Exceptions
  • We can identify CPU and wait bottlenecks going from a system wide view down to method level
    • Specific features to estimate AV overhead are present

This is just the beginning. You can check out the documentation for further inspiration how you can make use of system wide profiling data (https://github.com/Siemens-Healthineers/ETWAnalyzer) which can be queried from the command line.

ETWAnalyzer will not supersede WPA to analyze data. ETWAnalyzers main strength is in analyzing known patterns quickly in large amounts of extracted ETW data. Querying 10 GB of ETW data becomes a few hundred MB of extracted JSON data which has no slow symbol server lookup times anymore.

Happy bug hunting with your new tool! If you find bugs or have ideas how to improve the tool it would be great if you file an issue.

The Case Of A Stuck LoadLibrary Call

I frequently encounter impossible bugs which after full analysis have a reasonable explanation although I have no source code to the problematic library. Todays fault is featured by

….

suspense

the Intel Math Kernel Library.

Lets use some fancy neural network library (AlgoLibrary!UsingCaffee) like Berkeley AI caffee.dll which in turn uses the Intel Math Kernel Library (mkl_core.dll) which hangs forever in a call to LoadLibrary. The first thing to do when you are stuck is taking memory dump (e.g. use Task Manager or procdump)

and loading it into Windbg or Visual Studio. Since you are reading this I assume you want to know more about the mysterious commands Windbg has to offer. You can load a memory dump into Windbg from the File – Menu

To get an overview where all threads of a process are standing or executing you can dump with the ~* command the following command on all threads. The dump thread stack command in Windbg is k. To dump all threads you need to execute therefore

~*k

107  Id: 67b0.8094 Suspend: 0 Teb: 00000049`9df88000 Unfrozen
Call Site
ntdll!NtWaitForSingleObject+0x14
ntdll!LdrpDrainWorkQueue+0xe5
ntdll!LdrpLoadDllInternal+0xc2
ntdll!LdrpLoadDll+0xf1
ntdll!LdrLoadDll+0x8c
KERNELBASE!LoadLibraryExW+0x16f
KERNELBASE!LoadLibraryExA+0x31
KERNELBASE!LoadLibraryA+0x39
mkl_core!mkl_serv_mpi_get_ppn+0x48
mkl_core!mkl_serv_mkl_malloc+0x7e7
caffe!caffe::CaffeMallocHost+0x16
caffe!caffe::SyncedMemory::to_cpu+0x116
caffe!caffe::SyncedMemory::mutable_cpu_data+0xe
caffe!caffe::Blob<float>::Reshape+0x175
caffe!caffe::Blob<float>::Reshape+0x138
caffe!caffe::InputLayer<float>::LayerSetUp+0x184
caffe!caffe::Layer<float>::SetUp+0x15
caffe!caffe::Net<float>::Init+0x898
caffe!caffe::Net<float>::Net<float>+0x473
AlgoLibrary!UsingCaffee

When you load a memory dump into Windbg you can issues with the k command a full stack walk which will output hopefully a nice stack trace after you have resolved your symbols from Microsoft, Intel and the caffee.dll. We see that LoadLibrary is stuck while caffee tries to allocate memory which internally loads a library. We would expect that our call from AlgoLibrary!UsingCaffee should be finished within a few ms. But this one hangs forever.

The allocation code comes from this:

 *ptr = mkl_malloc(size ? size:1, 64);

When LoadLibrary hangs it is due to some other thread holding the OS Loader lock. This loader lock is effectively serializing dll loading and global/thread static variables initialization and release. See https://docs.microsoft.com/en-us/windows/win32/dlls/dllmain for a more elaborate description.

With Windbg and the !peb command we can check further

dt ntdll!_PEB 499df0d000
+0x0e8 NumberOfHeaps    : 0x15
+0x0ec MaximumNumberOfHeaps : 0x20
+0x0f8 GdiSharedHandleTable : 0x00000183`71d00000 Void
+0x100 ProcessStarterHelper : (null) 
+0x108 GdiDCAttributeList : 0x14
+0x10c Padding3         : [4]  ""
+0x110 LoaderLock       : 0x00007ffe`f447a568 _RTL_CRITICAL_SECTION
+0x118 OSMajorVersion   : 0xa
+0x11c OSMinorVersion   : 0
+0x120 OSBuildNumber    : 0x3839
+0x122 OSCSDVersion     : 0
+0x124 OSPlatformId     : 2

We can dump the Loader Lock structure by clicking on the blue underlined link in Windbg. But this formatting has not survived in WordPress.

0:107> dx -r1 ((ntdll!_RTL_CRITICAL_SECTION *)0x7ffef447a568)
((ntdll!_RTL_CRITICAL_SECTION *)0x7ffef447a568)                 : 0x7ffef447a568 [Type: _RTL_CRITICAL_SECTION *]
[+0x000] DebugInfo        : 0x7ffef447a998 [Type: _RTL_CRITICAL_SECTION_DEBUG *]
[+0x008] LockCount        : -2 [Type: long]
[+0x00c] RecursionCount   : 1 [Type: long]
[+0x010] OwningThread     : 0x6710 [Type: void *]
[+0x018] LockSemaphore    : 0x0 [Type: void *]
[+0x020] SpinCount        : 0x4000000 [Type: unsigned __int64]

Ahh now we know that thread 0x6710 is owning the lock. Switch to that thread with the command ~~[0x6710]s which allows you specify OS thread id directly. The usual ~dds command switches to debugger enumerated threads which start always at 0. That is convenient but sometimes you need to switch by thread id.

0:114> k

ntdll!NtYieldExecution+0x14
KERNELBASE!SwitchToThread+0x1d
mkl_core!mkl_serv_lock+0x6d
mkl_core!mkl_serv_thread_free_buffers+0x384
mkl_core+0x1016
mkl_core!sdttrsb_4dfl+0x31a
ntdll!LdrpCallInitRoutine+0x4b
ntdll!LdrShutdownThread+0x151
ntdll!RtlExitUserThread+0x3e
kernel32!BaseThreadInitThunk+0x1c
ntdll!RtlUserThreadStart+0x21

That stack trace is strange because it calls SwitchToThread which basically tells the OS if any other runnable thread for this core is ready to run it should run. If not this is essentially a CPU burning endless loop. Such constructs are normally used in Spinlocks where ones takes at first not lock at all to prevent the costs of a context switch at the expense of burning CPU cycles. In this case it looks like mkl_core is cleaning up some thread static variables which requires synchronization with some lock object which is taken by some other thread. Since I do not have the code for this I can only speculate which thread it is.

Lets get the CPU time this thread to see how long it tries to enter the lock:

0:114> !runaway
 User Mode Time
  Thread       Time
  114:6710     0 days 2:14:28.671
   17:62a4     0 days 0:00:47.656
   92:88bc     0 days 0:00:08.218
   22:107c     0 days 0:00:02.953
   19:3a9c     0 days 0:00:02.765

Nearly all CPU of the process (2h 15 minutes) is spent by this thread (2h 14 minutes) which wastes CPU cycles for hours.

Since we have only one other mkl thread running which is our stuck thread in the LoadLibrary call it all makes sense now. The LoadLibrary call hangs because thread 0x6710 is exiting and cleaning up some thread static variables which takes a lock which our stuck LoadLibrary thread already possesses. But since our shutting down thread owns the loader lock we have two dependant locks owned by two different threads -> Deadlock.

You need to be careful on which threads you execute MKL code which are exiting at some point in time while new threads are initializing new mkl variables. At some points in time you can see funny deadlocks which look really strange. One thread burns CPU like hell which is coming from an exiting thread and your own code is stuck in a LoadLibrary call which waits for the Loader Lock to become free.

If you dig deep enough you can find the root cause even when you do not have the full source code. As added bonus I have learned some arcane Windbg commands along the way. These will be helpful the next time you need to dig into some stuck threads.

Concurrent Dictionary Modification Pitfalls

Dictionary<TKey,TValue> is one of the most heavily used collection classes after List<T>. One of the lesser known things is how it behaves when you accidentally modify it concurrently.

What is the output of this simple program under .NET 3.1, .NET 5.0 and .NET 4.8?

using System;
using System.Collections.Generic;
using System.Threading.Tasks;

namespace ConcurrentDictionary
{
    class Program
    {
        static Dictionary<string, int> _dict = new();

        static void Modify()
        {
            _dict.Clear();
            for (int i = 0; i < 10; i++)
            {
                string key = $"Key {i}";
                _dict.ContainsKey(key);
                _dict[key] = 1;
            }
        }

        static void Main(string[] args)
        {
            for(int k=0; k < 1_000_000; k++)
            {
                if (k % 100 == 0)
                {
                    Console.WriteLine($"Modify Run {k}");
                }

                Parallel.Invoke(Modify, Modify, Modify);
            }
        }
    }
}

Under .NET Core 3.1, .NET 5.0 it will produce

D:\>ConcurrentDictionary.exe
Modify Run 0
Modify Run 0
Modify Run 100
Unhandled exception. 
 ---> System.InvalidOperationException: Operations that change non-concurrent collections must have exclusive access. A concurrent update was performed on this collection and corrupted its state. The collection's state is no longer correct.
   at System.Collections.Generic.Dictionary`2.FindValue(TKey key)
   at ConcurrentDictionary.Program.<>c__DisplayClass0_0.<Main>g__Modify|0() in D:\Source\vc19\ConcurrentDictionary\Program.cs:line 26
   at System.Threading.Tasks.Task.InnerInvoke()
   at System.Threading.Tasks.Task.<>c.<.cctor>b__277_0(Object obj)
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, 

This is similar what you get when you modify a list

Unhandled exception. 
 ---> System.InvalidOperationException: Collection was modified; enumeration operation may not execute.
   at System.Collections.Generic.List`1.Enumerator.MoveNextRare()
   at System.Collections.Generic.List`1.Enumerator.MoveNext()
   at ConcurrentDictionary.Program.ModifyList() in D:\Source\vc19\ConcurrentDictionary\Program.cs:line 31

When you modify a Dictionary or you enumerate a list while it is modified you get an exception back. That is good and expected.

Lets try the Dictionary sample with .NET 4.8 (or an earlier version)

D:>ConcurrentDictionary.exe
Modify Run 0
Modify Run 100

This time it will stop working and just hang around. Whats the problem with .NET 4.x? When in doubt take a memory dump with procdump

C:\temp>procdump -ma 2216

ProcDump v10.0 - Sysinternals process dump utility
Copyright (C) 2009-2020 Mark Russinovich and Andrew Richards
Sysinternals - www.sysinternals.com

[22:32:25] Dump 1 initiated: C:\temp\ConcurrentDictionary.exe_210810_223225.dmp
[22:32:25] Dump 1 writing: Estimated dump file size is 68 MB.
[22:32:25] Dump 1 complete: 68 MB written in 0.3 seconds
[22:32:25] Dump count reached.

Then open the Dump with Windbg

c:\Debuggers\windbg.exe -z C:\temp\ConcurrentDictionary.exe_210810_223225.dmp

Load sos dll and then dump all managed thread stacks

Microsoft (R) Windows Debugger Version 10.0.17763.132 AMD64
Copyright (c) Microsoft Corporation. All rights reserved.

Loading Dump File [C:\temp\ConcurrentDictionary.exe_210810_223225.dmp]
User Mini Dump File with Full Memory: Only application data is available

Comment: '
*** procdump  -ma 2216 
*** Manual dump'
Symbol search path is: srv*
Executable search path is: 
Windows 10 Version 19043 MP (8 procs) Free x64
Product: WinNt, suite: SingleUserTS
19041.1.amd64fre.vb_release.191206-1406
Machine Name:
Debug session time: Tue Aug 10 22:32:25.000 2021 (UTC + 2:00)
System Uptime: 0 days 0:22:07.133
Process Uptime: 0 days 0:00:45.000
.............................

0:000> .loadby sos clr

0:000> ~*e!ClrStack

OS Thread Id: 0x2370 (0)
Call Site
System.Collections.Generic.Dictionary`2[[System.__Canon, mscorlib],[System.Int32, mscorlib]].Insert(System.__Canon, Int32, Boolean)
ConcurrentDictionary.Program.Modify() [D:\Source\vc19\ConcurrentDictionary\Program.cs @ 18]
System.Threading.Tasks.Task.Execute()
System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
System.Threading.Tasks.Task.ExecuteWithThreadLocal(System.Threading.Tasks.Task ByRef)
System.Threading.Tasks.Task.ExecuteEntry(Boolean)
System.Threading.Tasks.ThreadPoolTaskScheduler.TryExecuteTaskInline(System.Threading.Tasks.Task, Boolean)
System.Threading.Tasks.TaskScheduler.TryRunInline(System.Threading.Tasks.Task, Boolean)
System.Threading.Tasks.Task.InternalRunSynchronously(System.Threading.Tasks.TaskScheduler, Boolean)
System.Threading.Tasks.Parallel.Invoke(System.Threading.Tasks.ParallelOptions, System.Action[])
ConcurrentDictionary.Program.Main(System.String[]) [D:\Source\vc19\ConcurrentDictionary\Program.cs @ 44]

OS Thread Id: 0x1e68 (6)
Call Site
System.Collections.Generic.Dictionary`2[[System.__Canon, mscorlib],[System.Int32, mscorlib]].FindEntry(System.__Canon)
System.Collections.Generic.Dictionary`2[[System.__Canon, mscorlib],[System.Int32, mscorlib]].ContainsKey(System.__Canon)
ConcurrentDictionary.Program.Modify() [D:\Source\vc19\ConcurrentDictionary\Program.cs @ 17]
System.Threading.Tasks.Task.Execute()
System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
System.Threading.Tasks.Task.ExecuteWithThreadLocal(System.Threading.Tasks.Task ByRef)
System.Threading.Tasks.Task.ExecuteEntry(Boolean)
System.Threading.ThreadPoolWorkQueue.Dispatch()

0:000> !runaway
 User Mode Time
  Thread       Time
    0:2370     0 days 0:00:44.843
    6:1e68     0 days 0:00:44.796
    2:928      0 days 0:00:00.015
    5:233c     0 days 0:00:00.000
    4:3560     0 days 0:00:00.000
    3:92c      0 days 0:00:00.000
    1:1fb8     0 days 0:00:00.000
	
0:000> .time
Debug session time: Tue Aug 10 22:32:25.000 2021 (UTC + 2:00)
System Uptime: 0 days 0:22:07.133
Process Uptime: 0 days 0:00:45.000
  Kernel time: 0 days 0:00:00.000
  User time: 0 days 0:01:29.000

We find that we consume a lot of CPU on thread 0 and 6 (44s until the dump was taken). These methods call into Dictionary.FindEntry and Dictionary.Insert which is strange. Why should it take now much more CPU to insert things into a Dictionary?

In the meantime Visual Studio has become pretty good at loading memory dumps as well. The by far best view is still the Parallel Stacks Window

But for now we turn back to Windbg because it offers information which threads consume how much CPU with !runaway on a per thread basis and .time for the complete process. This helps a lot to look at the right threads which are consuming most CPU.

To drill deeper into the innards of the Dictionary we switch to the best Windbg Extension for managed code which is netext. Unzip the latest release for your Windbg into the Windbg\winext folder or you need to fully qualify the path to the x86/x64 version of netext.

0:000> .load netext
netext version 2.1.64.5000 Jun  2 2021
License and usage can be seen here: !whelp license
Check Latest version: !wupdate
For help, type !whelp (or in WinDBG run: '.browse !whelp')
Questions and Feedback: https://github.com/rodneyviana/netext/issues 
Copyright (c) 2014-2015 Rodney Viana (http://blogs.msdn.com/b/rodneyviana) 
Type: !windex -tree or ~*e!wstack to get started

0:006> ~6s
0:006> !wstack

Listing objects from: 0000007a30afe000 to 0000007a30b00000 from thread: 6 [1e68]

Address          Method Table    Heap Gen      Size Type
@r8=000001eb5a279ef8 00007ffa81d38538   0  0         92 System.Int32[]
@rcx=000001eb5a27ffe8 00007ffa81d4d880   0  0        432 System.Collections.Generic.Dictionary_Entry[]
@rdi=000001eb5a2d69e0 00007ffa81d359c0   0  0         36 System.String Key 1
@rdx=000001eb5a27ffe8 00007ffa81d4d880   0  0        432 System.Collections.Generic.Dictionary_Entry[]
@rsi=000001eb5a272d88 00007ffa81d46db0   0  0         80 System.Collections.Generic.Dictionary
000001eb5a2d69a8 00007ffa81d385a0   0  0         24 System.Int32 0n1
000001eb5a274d48 00007ffa81d3a440   0  0        216 System.Globalization.NumberFormatInfo
000001eb5a2d69c0 00007ffa81d359c0   0  0         28 System.String 1
000001eb5a28e350 00007ffa81d39e00   0  0         48 System.Text.StringBuilder
000001eb5a279a50 00007ffa81d359c0   0  0         40 System.String Key {0}
19 unique object(s) found in 1,716 bytes

0:006> !wdo 000001eb5a27ffe8
Address: 000001eb5a27ffe8
Method Table/Token: 00007ffa81d4d880/200000004 
Class Name: 
System.Collections.Generic.Dictionary_Entry<System.String,System.Int32>[]
Size : 432
Components: 17
[0]: -mt 00007FFA81D4D820 000001EB5A27FFF8
[1]: -mt 00007FFA81D4D820 000001EB5A280010
[2]: -mt 00007FFA81D4D820 000001EB5A280028
[3]: -mt 00007FFA81D4D820 000001EB5A280040
[4]: -mt 00007FFA81D4D820 000001EB5A280058
...
0:006> !wdo -mt 00007ffa81d4d820 000001eb5a280010
System.Collections.Generic.Dictionary_Entry<System.String,System.Int32>
00007ffa81d3a238 System.__Canon+0000    key 000001eb5a2d6980 Key 0
00007ffa81d385a0 System.Int32 +0008     hashCode 6278e7e7
00007ffa81d385a0 System.Int32 +000c     next 1 (0n1)
00007ffa81d385a0 System.Int32 +0010     value 1 (0n1)

The .NET Dictionary stores the data in buckets which is a linked list. The nodes are referenced via an index with the next field. In the dysfunctional case we find at array location 1 as next node index 1 which will cause the node walk to never end.

The decompiled code shows this

The for loop calls i = this.entries[i].next in the incremeent condition. This never incrementing loop will not end and explains the high CPU on two threads.

How do these things usually crop up? When more and more code is made multithreaded and async await is added to the mix you can find in a slow or stuck process often high CPU in FindEntry or Insert. Now you know this indicates not a huge Dictionary which has just become inefficient.

Actually you have found a Dictionary multithreading bug if you are running on .NET 4.x!

You can create a stack tag for these methods and flag them as potential concurrent Dictionary modification if these consume one core or more constantly like this to directly show up in WPA:

The stacktag file for this is

<?xml version="1.0" encoding="utf-8"?>

<Tag Name=""> 
<Tag Name=".NET">	
<Tag Name="Dictionary FindEntry\If one or more Core is consumed this indicates a concurrent dictionary modification which results in endless loops!">
		<Entrypoint Module="mscorlib.ni.dll" Method="System.Collections.Generic.Dictionary*.FindEntry*"/>
		<Entrypoint Module="mscorlib.dll" Method="System.Collections.Generic.Dictionary*::FindEntry*"/>
		<Entrypoint Module="System.Private.CoreLib.dll" Method="System.Collections.Generic.Dictionary*.FindEntry*"/>
	</Tag>
	<Tag Name="Dictionary Insert\If one or more Core is consumed this indicates a concurrent dictionary modification which results in endless loops!">
		<Entrypoint Module="mscorlib.ni.dll" Method="System.Collections.Generic.Dictionary*.Insert*"/>
		<Entrypoint Module="mscorlib.dll" Method="System.Collections.Generic.Dictionary*::Insert*"/>
		<Entrypoint Module="System.Private.CoreLib.dll" Method="System.Collections.Generic.Dictionary*.Insert*"/>
	</Tag>
</Tag>
</Tag>

This is enough debugging Voodoo for today. Drop me a note if you have been biten by this issue and how you found it.

How To Not Satisfy Customers – Pixel 5 Android Crash

Recently I have bought a Pixel 5 which worked for about a week perfectly. While driving at a motorway and trying to connect to some Bluetooth device which did not work it rebooted a few minutes later while browsing the web. No problem, after a quick boot it will be right back, I thought. But I was greeted with this (German) error message:

“Cannot load Android system. Your data may be corrupt. If you continue to get this message, you may need to perform a factory data reset and erase all user data stored on this device.”

image

Retry or “Delete Everything” which is the synonym for Factory Reset are the only options. That sounds like a bad trade. Several retries later it still refused to boot. In some forum I have read that I can copy the OS image back via the Debugger Interface. Luckily I had enabled Developer Mode on my phone. After downloading first the wrong version because Google had at that point in time the OS Image I had installed not listed (https://developers.google.com/android/ota#redfin) and did not mention that you cannot install an OS Image which is older than the current security patch level. I got it  via https://www.getdroidtips.com/february-2021-patch-pixel-device/ on my phone:

image

Unfortunately that did also not work. There is a secret menu in Android which allows to view e.g. some logs and some other options which I cannot tell if they are useful.

image

Here are some log messages which might or might not be useful to anyone else

Filesystem on /dev/block/by-name/metadata was not cleanly shutdown; state flags: 0x1, incompat feature flags: 0xc6

Unable to set property "ro.boottime.init.fsck.metadata" to "17": error code: 0x18


Unable to enable ext4 casefold on /dev/block/by-name/metadata because /system/bin/tune2fs is missing
google/reding/redfin
11/RQ2A.210305.006/719741
user/release-keys

Android System could not be loaded

At that point I gave up and opened a thread https://support.google.com/pixelphone/thread/105622202/pixel-5-boot-loop-how-to-recover-data?hl=en at the Pixel support forum. The problem is that I wanted to save my photos because I do not trust Google to not scan over my entire image gallery nor any other Cloud provider. A manual USB sync does the trick and it has served me well. The problem is that Android encrypts every sector on the phone so there is zero chance to read anything unless you have a booted Android system which has unlocked the master key.

I know Bitlocker from Windows which allows me to export my secret keys so I can later access the encrypted data. Perhaps I have missed some tutorial to export my phone encryption keys but it turns out there is no way to export the key! I find it hard to believe that an OS used by billions of people has no data recovery workflow I could follow.

Google wants me to sync everything to the Cloud or I will loose my data if the OS has a bad day on an otherwise fully functional device. If the phone would have an external SD slot it would not have encrypted my data by default (I can enable that if I need to) which would be ok. But Pixel phones have no SD Card slot. I feel thrown back to the 90s where the rule save often, save early was good advice to prevent data loss. 30 years and millions of engineering hours later using an Android device feels like a retro game with a bad user experience if you to not want to share your data with Google.

Unfortunately that is the end of the story. I am pretty good with computers and having studied particle physics has helped me to analyze and solve many difficult problems. But if a system is designed in a way to lock me out when an error occurs then the bets are off. It is certainly possible to fix these shortcomings but I guess this is pretty low on Googles Business priority list where the Cloud and AI the future and it does not matter on which device your data is living. In theory this sounds great, but  corner cases like what happens to unsynced photos/videos if the same error happens, or if you do not trust the Cloud provider are just minor footnotes on the way to the Cloud.

After a factory reset the phone worked and so far it has not happened again. I think the Cloud is a great thing but as a customer I should have a choice to not share my data with Google without the fear to risk complete data loss every time the OS has detected some inconsistency.

When Things Get Really Bad – NTFS File System Corruption War Story

The setup is a PC which runs software version 1.0. After an upgrade to 2.0 everything did work but after 1-2 weeks the machine would not start the software v2.0 anymore. When the issue was analyzed it turned out that a xml configuration file was starting with binary data which prevented v2.0 from starting up.

Good file

<?xml version="1.0" encoding="utf-8"?>
  <import resource="../../Config1.xml"/>    
  <import resource="../../Config2.xml"/>
  <import resource="../../Config3.xml"/>
  <import resource="../../Config4.xml"/>

Bad file

DCMPA30€×ÕÞ±# EÔ°ƒ3¡;k”€
0…\
*d*“B  ...
    <control name="MissileControl" ref="IMissileControl"/>
    <control name="PositionControl" ref="IPositionControl"/>
    <control name="ReservationNotify" ref="IReservationNotify"/>
</config>

The binary data did cover the first 4096 bytes of the xml file which was certainly not good. This raises the question if other files have been corrupted and if we can we trust that machine at all? That calls for a tool that scans the entire hard drive and calculates the checksum of all files on that machine which can then be compared to all other files from a good machine with identical setup. This is not a new problem so there should be some open source software out there. I found hashdeep which was developed by Jesse Kornblum an agent of some government agency. It was exactly the thing I did need. You can generate from a directory recursively a list of files with their hashes (audit mode) from a good machine and compare these hashes on another machine and print the missing/new/changed files as diff output.

The audit file generation and file verification are straightforward:

hashdeep -r dir  > /tmp/auditfile            # Generate the audit file
hashdeep -a -k /tmp/auditfile -r dir         # test the audit

The current version seems to be 4.4 and can be downloaded from Github. If you are curious here are the supported options:

C:\> hashdeep64.exe [OPTION]... [FILES]...
-c <alg1,[alg2]> - Compute hashes only. Defaults are MD5 and SHA-256
                   legal values: md5,sha1,sha256,tiger,whirlpool,
-p <size> - piecewise mode. Files are broken into blocks for hashing
-r        - recursive mode. All subdirectories are traversed
-d        - output in DFXML (Digital Forensics XML)
-k <file> - add a file of known hashes
-a        - audit mode. Validates FILES against known hashes. Requires -k
-m        - matching mode. Requires -k
-x        - negative matching mode. Requires -k
-w        - in -m mode, displays which known file was matched
-M and -X act like -m and -x, but display hashes of matching files
-e        - compute estimated time remaining for each file
-s        - silent mode. Suppress all error messages
-b        - prints only the bare name of files; all path information is omitted
-l        - print relative paths for filenames
-i/-I     - only process files smaller than the given threshold
-o        - only process certain types of files. See README/manpage
-v        - verbose mode. Use again to be more verbose
-d        - output in DFXML; -W FILE - write to FILE.
-j <num>  - use num threads (default 8)

After checking all files I found no other corruptions. This is really strange. Why did the xml file not become corrupt during the upgrade but 2 weeks later? The config file is read only and should be never written by anybody. It could be a simple glitch in some memory bank of the SSD but since the exact same error did happen on several machines we can rule out cosmic rays as source of our problem. But can radiation be a problem? Intel has tried it out by putting a SSD into the beam of a particle accelerator. When enough radiation in the game anything can happen but it will be happening at random locations.

Could it be a hard power off while the system was writing data to the disk? If the OS does everything right the SSD might be having not enough power to store the data. But that theory could also be ruled out because the SSD of that machine had a built in capacitor to ensure that all pending writes are safely written even during a hard power off event.

If you have thought of enough nice theories you need to get more evidence (read data). In that case checking the Windows Event Logs could help to find some clues. One first hint was that after the machine was bricked the application event log showed that Defrag has been run. Why on earth is Defrag running on a SSD? That should not be needed since we have no need to realign our data to optimize disk seek times. It turned out that Defrag was reenabled by accident during the update of v2.0.

image

The default settings were that it would run every day late in the evening. Windows is smart enough to not defrag a SSD. Instead it will trigger SSD Trim to let the SSD do its own reorg things. But that is not all. Although Defrag is not performing a typical defrag operation it still defragments directories and other NTFS tables. NTFS becomes dead slow when you store ~ 100K files in a single directory. Then you will hit hard limits of the NTFS file system structures. These internal lists need some cleanup in case there are many file add/remove operations happening in some hot directories which could lead after some time to a slow or broken file system. To prevent that Defrag is still doing some house keeping and moving some things around even on SSDs.

Since not all machines are running until 9 PM it also explains why after the software update the machine did break randomly within two weeks after the update. When defrag did run and the machine was shut down in the next morning the machine is turned on it would refuse to work due to the corrupted xml configuration file.

That is a nice finding but it still does not explain why Defrag would ruin our xml file. Lets check the system event log for interesting events. With interesting I mean messages like these

System Event Log Messages

Id Source Meaning Example Message
6005 EventLog System Start The Event log service was started.
6006 EventLog System Stop The Event log service was stopped.
6008 EventLog Hard Reboot The previous system shutdown at xxxx AM on ‎dd/dd/dddd‎ was unexpected.
41 Microsoft-Windows-Kernel-Power Hard Reboot The system has rebooted without cleanly shutting down first. This error could be caused if the system stopped responding, crashed, or lost power unexpectedly.
109 Microsoft-Windows-Kernel-Power Reboot initiated by OS The kernel power manager has initiated a shutdown transition
131 Ntfs NTFS Corruption “The file system structure on volume C: cannot be corrected.”
130 Ntfs NTFS Repair (chkdsk) The file system structure on volume C: has now been repaired.

Application Event Log Messages

Id Source Meaning Example Message
258 Microsoft-Windows-Defrag Defrag Run The storage optimizer successfully completed defragmentation on System (C:)
258 Microsoft-Windows-Defrag Defrag Run The storage optimizer successfully completed retrim on System (C:)
26228 Chkdsk chkdsk Repair Chkdsk was executed in verify mode on a volume snapshot.

 

The next clue was that after Defrag has been run the NTFS volume needed a repair.

Chkdsk        26228        None        "Chkdsk was executed in verify mode on a volume snapshot.

Checking file system on \Device\HarddiskVolume2

Volume label is System.

Examining 1 corruption record ...

Record 1 of 1: Bad index ""$I30"" in directory ""\Users\xxxx\AppData\Roaming\Microsoft\Windows\Start Menu\Programs\System Tools <0x4,0x15437>"" ... The multi-sector header signature for VCN 0x0 of index $I30

in file 0x15437 is incorrect.

44 43 4d 01 50 41 33 30 ?? ?? ?? ?? ?? ?? ?? ?? DCM.PA30........

corruption found.

That is a helpful message! After Defrag has been run NTFS is broken. This NTFS corruption is then fixed by chkdsk which is complaining about some start menu entry. Luckily chkdsk prints also the first bytes of the corrupted file which starts with DCM.PA30 which is the same as our corrupted xml file! I did extract the binary file contents of the broken xml  file and put it in an extra file. Then I did run hashdeep.exe over it to check if this binary data is the file content of some other file. Since I have only the first 4K of the data I was not hopeful to find something. But lucky me I did find a file which did match. It looks like we did mix the contents of

c:\Windows\WinSxS\Manifests\amd64_windows-defender-management-onecore_…_none_….manifest

with our xml file and the start menu entry shortcut! The question remains what Windows Defender files have to do with our xml and the start menu entry. Lets check the System Event Log for interesting events:

image

Now we can replay the events that lead to the NTFS corruption

  1. OS is installed
  2. Device Guard is enabled (not shown in the logs)
  3. Windows shuts itself down (“The kernel power manager has initiated a shutdown transition”)
  4. Sometimes the system hangs and the person installing the machine presses the power button to perform a hard reset
    • This caused the initial NTFS corruption
  5. After boot NTFS detects that something is corrupt which is repaired although one message tells us that it cannot be repaired
    1. Volume C … needs to be taken offline to perform a Full Chkdsk
    2. A corruption was discovered in the file system structure on volume
    3. The file system structure on volume C: cannot be corrected
    4. The file system structure on volume C: has now been repaired
  6. Now comes our part. After the repair was done automatically the software v1.0 is installed which writes data on an ill repaired NTFS volume and continues to run happily until …
  7. v2.0 of the software is installed which enables defrag by accident again
  8. The software version 2.0 runs happily it seems …
  9. But when the machine did run at 9:00 PM in the evening defrag is executed which defragments NTFS tables which uses referential information from other tables which destroys the first 4K of our xml file and we end up with some Windows Defender file contents.
  10. The next morning the machine would refuse to work
  11. Problems ….

That was an interesting issue and as you can guess this analysis was not done by me alone but several people contributed to connect all the loose ends. Thank You! You know who you are.

What can we learn from this?

  • Log messages are really important
  • Keeping all of them is essential to find the root cause
    • Overwriting the event log with myriads of messages is not the brightest idea
  • Luck: We got lucky that the System Event Log history included the initial OS installation

Hard Power Off

This is another case where transactions do not help to get an always consistent file system. What else can happen? Lets try it out by writing to some log files and then power off the machine while it is writing data to disk.

I used this simple C# program to write to several files in an alternating manner. Not sure if that is needed but it seems to help to force issues sooner

using System.Collections.Generic;
using System.IO;

class Program
{
    static void Main(string[] args)
    {
        List<Stream> streams = new List<Stream>();
        const int Files = 5;

        for (int i = 0; i < Files; i++)
        {
            string fileName = $"C:\\temp\\TestCorruption{i}.txt";
            FileStream stream = new FileStream(fileName, FileMode.OpenOrCreate, FileAccess.Write, FileShare.ReadWrite);
            streams.Add(stream);
        }

        while (true)
        {
            for (int i = 0; i < Files; i++)
            {
                using StreamWriter writer = new StreamWriter(streams[i], leaveOpen:true);
                writer.WriteLine("Hello this is string 1");
                writer.Flush();
                writer.Write("Hello this is string 2");
                writer.Flush();
            }
        }
    }
}

We are writing to 5 log files the string “Hello this is string1”, flush the data to enforce a WriteFile call to the OS so that our process has no pending buffered data, then another string and enforce WriteFile again. If the process crashes we get consistent data in our log file because the OS has already got the data and it will write at some time later the pending data to disk. But what will happen when we do a hard power off?

Lets look at our log file

Hello this is string 2Hello this is string1
Hello this is string 2Hello this is string 1
Hello this is string 2Hello this is string 1
Hello this is string 2Hello this is string 1
Hello this is string 2Hello this is string 1
Hello this is string 2Hello this is string 1
Hello this is string 2Hello this is string 1
Hello this is string 2Hello this isNULNULNULNULNULNULNULNULNULNUL....

Who did write the 0 bytes to our text file? We did write only ASCII data to our files! When WriteFile(“Buffered data) is executed and the OS wants to write that data to disk the OS needs to

  1. Increase the file size in the directory
  2. Write the buffered data over the now nulled cluster

Step 1 and 2 are related but a hard power off can lets 1 happen but not 2. This can result in a text file which contain larger amounts of 0 bytes at the end for no apparent reason. See https://stackoverflow.com/questions/10913632/streamwriter-writing-nul-characters-at-the-end-of-the-file-when-system-shutdown where someone with more NTFS know how than me answers that question:

That happens. When you append a file first its size is corrected in the directory (and that’s transactional in NTFS) and then the actual new data is written. There’s good chance that if you shut down the system you end up with a file appended with lots of null bytes because data writes are not transactional unlike metadata (file size) writes.

There’s no absolute solution to this problem.

answered Jun 6 ’12 at 11:53

sharptooth

156k

When a Cluster is allocated it is nulled out to ensure that no previous data can be there. While you append data to a file you can miss the last write although the file size already contains the updated value which makes the 0s visible in a text editor. That is a nasty issue if you expect that nobody will ever reset a machine. I am not aware of a way around that except that you need to be prepared that some text files might contain 0 bytes at the end due to a hard power off.

Was the file system corrupted this time again? What does the System event log tell us?

Critical    9/5/2020 5:40:52 PM    Microsoft-Windows-Kernel-Power    41    (63)    The system has rebooted without cleanly shutting down first. This error could be caused if the system stopped responding, crashed, or lost power unexpectedly.
Information    9/5/2020 5:40:52 PM    Microsoft-Windows-Ntfs    98    None    Volume \\?\Volume{12a882c8-5384-11e3-8255-806e6f6e6963} (\Device\HarddiskVolume2) is healthy.  No action is needed.

NTFS says everything is green. From a file integrity perspective everything did work, until power was cut. Your hard disk did miss the last write but hey: Who did guarantee you that this is an atomic operation?

Conclusions

Even impossible errors have an explanation if you are willing to dig deep. What was your nastiest error you did manage to find the root cause?

MS Performance HUD – Analyze EventLog Reading Performance In Realtime

This post explains why your home machine can work fast while many corporate PCs show high CPU consumption at random times especially after boot for several minutes with no apparent explanation.  Event Tracing for Windows (ETW) is a fantastic piece of technology. I love it because it enables me and perhaps you to tackle systemic issues regardless if the problem is my application, an Explorer plugin, a slow device driver or your Antivirus Scanner gone wild. Last time I have complained about slow WMI queries (https://aloiskraus.wordpress.com/2019/11/17/the-wmi-query-from-hell/) which were induced by some System Management software. It turns out that reading from the Windows Eventlog is a quite complex undertaking when you try to understand the slow read performance. For starters I try to read all Application event log messages via a WMI query

 

Slow Event Log Read Performance

SELECT * FROM Win32_NTLogEvent WHERE Logfile = ‘Application’

This runs ca. 35s which is quite a while. To put the number into perspective lets check the size of the event log which is 25 MB where 60232 messages were stored. This means we are reading a binary file with 0,7MB/s on my i7-4770K CPU @ 3.50GHz. That is really slow! Lets record some data with ETW and analyze it with the new Performance HUD Tool which was released at 29. May 2020. When I select the event log service which shows high CPU I can easily find why it is so busy:

image

We are reading 60K messages. While we are doing that the event log service process accesses the Registry over 800K times to find the currently configured language! The method EvtRpcMessageRender formats an event log message which in turn calls FormatMessage which does not reuse any cached thread culture but always reads the data from the registry. This means we have for every read message 13 registry accesses which is way to much. The Performance HUD shows on the lower grid performance insights and a tooltip what you can do about it. The performance of FormatMessage seems to be less important than to ensure 100% correctness when the language is switched because it must use the updated value instantaneously. The HUD Tool makes ETW data much more accessible than it was before. If I want to know how much data the Event Log service has read I just need to switch over to the File I/O tab which shows me the 25 MB parsed Application event log. If I need it I can also get the call stack by hovering over the blue underlined and press Ctrl+Click to find the corresponding stack traces.

image

Now we have proven that the event log service reads 25 MB of data but to format the read data it needs  800K registry reads just to determine the current language which costs in total over 35s of execution time. At least that is what I can see with the HUD Tool inside a single process. This tool is definitely not a v1.0 and is used within MS for quite some time based on the many helpful tooltips from performance experts. The HUD Tools is a great addition to the toolbox which warrants more detailed posts in the future. If you have problems to get it running drop me a note and I will see if I can help.

Performance HUD can be run in two modes

  • Realtime attached to a specific process
  • Offline while parsing data from an existing recorded ETL file
    • This requires CPU Profiling, Context Switch, File IO, Registry tracing with stacks enabled
    • For UI Hangs and User/GDI Handle leaks you need additionally to record the Microsoft-Windows-Win32k ETW Provider with 0x8000030000240000 flags + Stacks
      • The new GDI/User Leak tracking flags are (Win 10 1903. Windows Server 2019 e.g. is based on Win 10 1807. No server OS or Win 10 LTS release supports that yet …)
        • 0x0000010000000000  GdiHandleOperation
        • 0x0000020000000000  UserHandleOperation

Realtime ETW tracing is a great thing which makes it much easier for the average performance investigator to analyze ETW data right away. Especially UI thread hangs due to blocking calls got fantastic support which makes finding your UI hangs a no brainer.

 

System Wide Analysis

A first check is great to do in HUD but a more detailed analysis is still best done in WPA (and a 4K monitor). Lets see how the CPU consumption across processes is spent:

 

image

image

The WMI Query is executed from my test process named EventLogFlusherAndReader.exe (14796) which consumes 10s mostly reading registry keys for the COM class factory of some WMI COM objects. …

Next is is the svchost(1676) which is the Event Log Service Process with also 10s of CPU. There it is spending also most of the time as we have seen earlier with Registry Reading. In our case it is 5s/10s. Below is a picture how a WMI Event Log Query is actually executed:

 

image

WMI can do much but only with the help of many lackeys which work for us. We are reading a binary file and decode its messages but with WMI this causes significant CPU in 4 other system processes! Most time is spent inside the Event Log Service and the WMI Worker process and our WMI reading process. I was wondering why the EventLogReader process consumes so much CPU while it should only receive the data from WMI. It turns out there is more behind it because the COM objects which are used are not only for WMI but also for VirtualBox. Seem the VM host intercepts quite a bit of calls on my machine. It did uninstall VirtualBox (this was quite a journey because I have move the C:\Windows\Installer folder to a different drive which breaks many things. Note to self: Do this never again!) and checked again. Sure enough the additional Registry Queries are gone but the total runtime did not change much. The reason is that the Performance HUD Tool enables expensive ETW Logging event for every Registry access which exaggerates the costs of Registry accesses quit a bit. With HUD profiling enabled I did see application slowdowns of up to 40%. You should therefore still measure without Performance HUD enabled to get a less distorted picture what is really our bottleneck.

So what can we do about slow event log reading performance with WMI? As it turns out not much. MS should really check if it needs to access the language registry keys so often without any caching. In theory we can format the message by ourselves but this is only part of the story since ETW supports for an ETW event log message up to 3 resource dlls.

image

 

  1. MessageFileName: Same as old event log message table dlls which contains the localized resource string where the paramters are expanded at %1 %2, … as classic events
  2. ParameterMessageFile:  This contains constants such as Access Denied, Granted, …  which are defined with two %% followed by the resource number e.g. %%8100
  3. ResourceFileName: Defines the WEVT_TEMPLATE resource which contains event names, task names etc inside this binary structure

What will happen when I format messages in the old way by loading the resource dll and format the expansion parameters by myself? For an ETW Provider which is nice and which defines only resource based strings in the EventLog like this

image

I can correctly decode the message with the old EventLog API of .NET:

image

But when I remove the EventMessageFile registry key then I get an unresolved message just like any “modern” ETW provider would declare it in the EventLog if it is not nice to legacy APIs: logMsg.Message would error out and give me a message

The description for Event ID ‘xxx’ in Source ‘xxxxx’ cannot be found. The local computer may not have the necessary registry information or message DLL files to display the message, or you may not have permission to access them.

Event Log providers since Windows Vista only need a ProviderGuid below the event log registry key which makes it impossible to format resource based messages without the newer ETW Event Rendering APIs.

This is an issue if you are monitoring log messages issued by low level device drivers to report e.g. disk failures and other hardware issues. These important messages will slip through your monitoring system if you use the .NET System.Diagnostics.EventLog class. It is the fastest way to read messages but unfortunately you will miss out messages. The situation can be fixed to some extent by creating a registry key EventMessageFile for the ETW provider below the corresponding Event Log Source registry node. You need to look up the ProviderGuid below Computer\HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows\CurrentVersion\WINEVT\Publishers\ to find the corresponding MessageFileName. That will enable proper rendering of ETW based providers.

 

image

 

There is one caveat though. Fixed string constants which are formatted via %% in the resource based messages that are baked up by the ParameterFileName resource dll will not be formatted and will stay as placeholders in your formatted message. I have seen so far mostly for security related ETW providers to use that more advanced string expansion machinery.

The best option is still to use the newer .NET APIs in the form of System.Diagnostics.Eventing.Reader.EventLogReader which will will properly format all messages. The only issue is that this API just like WMI calls into the event log service where the messages formatting is done out of process which is much slower. I might be obsessed with performance when I complain about newer but more powerful APIs. Lets check things by reading and formatting 10K messages from the Application and Security Event log with the old System.Diagnostics.EventLog (readoldapi) and the newer System.Diagnostics.Eventing.Reader.EventLogReader (read) API.

C>EventLogFlusherAndReader.exe read -n 10000 -p 1 -l Application
Did read 10000 messages from log Application in 0.10s, 97,749 Messages/s, Format Repeat Count: 1
Total Formatting time 10.4 s.

C>EventLogFlusherAndReader.exe readoldapi -f -n 10000 -l Application
Read up to 10000 entries from Application log
Did read 10000 message table based event log messages from log Application in 1.45s, 6,899 Messages/s
Formatting time 0.4 s.
Reading and formatting for 10000 did take 1.85s, 5,410 Messages/s

 

C>EventLogFlusherAndReader.exe read -n 10000 -p 1 -l Security
Did read 10000 messages from log Security in 0.12s, 82,068 Messages/s, Format Repeat Count: 1
Total Formatting time 129.4 s.

C>EventLogFlusherAndReader.exe readoldapi -f -n 10000 -l Security
Read up to 10000 entries from Security log
Did read 10000 message table based event log messages from log Security in 1.37s, 7,277 Messages/s
Formatting time 0.6 s.
Reading and formatting for 10000 did take 1.98s, 5,041 Messages/s

 

With the “new” API we are over 20 times slower because formatting is now done in the EventLog Service which causes like WMI a lot of overhead. But what stands out is that reading data from the Security Event Log is > 200 times slower compared to the “old” EventLog API. Since .NET is open source it is easy to debug this and I have opened an issue https://github.com/dotnet/runtime/issues/34568 with a request to fix. As usual there is little motivation to change such old legacy code but I was invited to provide a PR which I did. After a bit of back and forth due to test failures I have landed this single line fix: https://github.com/dotnet/runtime/pull/36188/commits/dd0e212d75218f3b0eef3bc1ab812d7eb763279e to remove a superfluous check which did not serve any real purpose. The underlying issue was that for every message format call a check was done to format all task names. The Security Event Log defines 60 task names which explains the bad parsing performance for the Security Event Log. By removing a single line it did get 26 times faster which is not bad. This change will light up in .NET Core 5.0 so stay tuned. In the meantime you can still recompile the event log reading dll at your own from .NET and use it from a regular .NET application without any issues. It is even so easy to copy the source code from .NET Core and put it into your own project. Here are the results:

 

C>EventLogFlusherAndReader.exe read -n 10000 -p 1 -l Application
Did read 10000 messages from log Application in 0.10s, 99,245 Messages/s, Format Repeat Count: 1
Total Formatting time 2.7 s.

 

C>EventLogFlusherAndReader.exe read -n 10000 -p 1 -l Security
Did read 10000 messages from log Security in 0.12s, 80,601 Messages/s, Format Repeat Count: 1
Total Formatting time 4.3 s.

 

That speeds up reading event logs with .NET between 3-30 times depending on how many task names the log you are parsing has defined. It is still ca. 7 times slower compared to the old API which might be ok if you are not overdoing things.

Conclusions

The Security Event Log can be much bigger than 20 MB because some corporate security departments follow the MS security guidance to audit everything which needs a huge 256 MB or even a 1 GB Security Event Log file. If it would be much smaller it would overwrite the oldest events within hours depending on how your audit policies are configured. To check your current Audit Policies you can run auditpol.exe. On my home machine everything looks good but in a corporate setup things might be different. Especially Success audit messages can fill up your log files …

C>auditpol /get /category:*
System audit policy
Category/Subcategory                      Setting
System
  Security System Extension               No Auditing
  System Integrity                        Success and Failure
  IPsec Driver                            No Auditing
  Other System Events                     Success and Failure
  Security State Change                   Success
Logon/Logoff
  Logon                                   Success and Failure
  Logoff                                  Success
  Account Lockout                         Success
  IPsec Main Mode                         No Auditing
  IPsec Quick Mode                        No Auditing
  IPsec Extended Mode                     No Auditing
  Special Logon                           Success
  Other Logon/Logoff Events               No Auditing
  Network Policy Server                   Success and Failure
  User / Device Claims                    No Auditing
  Group Membership                        No Auditing
Object Access
  File System                             No Auditing
  Registry                                No Auditing
  Kernel Object                           No Auditing
  SAM                                     No Auditing
  Certification Services                  No Auditing
  Application Generated                   No Auditing
  Handle Manipulation                     No Auditing
  File Share                              No Auditing
  Filtering Platform Packet Drop          No Auditing
  Filtering Platform Connection           No Auditing
  Other Object Access Events              No Auditing
  Detailed File Share                     No Auditing
  Removable Storage                       No Auditing
  Central Policy Staging                  No Auditing
Privilege Use
  Non Sensitive Privilege Use             No Auditing
  Other Privilege Use Events              No Auditing
  Sensitive Privilege Use                 No Auditing
Detailed Tracking
  Process Creation                        No Auditing
  Process Termination                     No Auditing
  DPAPI Activity                          No Auditing
  RPC Events                              No Auditing
  Plug and Play Events                    No Auditing
  Token Right Adjusted Events             No Auditing
Policy Change
  Audit Policy Change                     Success
  Authentication Policy Change            Success
  Authorization Policy Change             No Auditing
  MPSSVC Rule-Level Policy Change         No Auditing
  Filtering Platform Policy Change        No Auditing
  Other Policy Change Events              No Auditing
Account Management
  Computer Account Management             No Auditing
  Security Group Management               Success
  Distribution Group Management           No Auditing
  Application Group Management            No Auditing
  Other Account Management Events         No Auditing
  User Account Management                 Success
DS Access
  Directory Service Access                No Auditing
  Directory Service Changes               No Auditing
  Directory Service Replication           No Auditing
  Detailed Directory Service Replication  No Auditing
Account Logon
  Kerberos Service Ticket Operations      No Auditing
  Other Account Logon Events              No Auditing
  Kerberos Authentication Service         No Auditing
  Credential Validation                   No Auditing

 

Reading a big event log by **** Some State Of The Art Security Solution ****  to check for malicious activity might kill interactive user experience since you are now maxing out two cores (the reader process and the event log service formatting your messages) for many minutes. A 1 GB Security Event Log file would need even with the fixed .NET version ca. 10 minutes to read. The current way how .NET handles event log messages would cause it to scan for over 5h through a 1 GB Security Event Log file. I hope such a security “solution” is not in production anywhere but one never knows for sure if some Admin has created a nice powershell script which uses Get-WinEvent to read all messages with a non deprecated up to date API in his Active Directory Forest. This post exists to make more people aware of the performance issues related to event log reading and applying tight security auditing policies. If you see minutes of high CPU spent in svchost.exe (it is a generic service host) which hosts the Event log service you can be pretty sure that someone is reading large event log file/s on your machine via powershell, WMI, or other ways.

image

Most often this is some security related software which is not fully optimized yet. Lets not give up but improve the tools you use and check if the policies which generate most log messages are really needed. If not you can reduce the size of the log and still keep the rollover time for the oldest log messages constant.

With Performance HUD it should be much easier to come to the same conclusions than I did. Try out the new tool and tell me what you have found! And lets hope that someone makes FormatMessage faster.

The WMI Query From Hell

this is about a case where I did wonder why my Windows installation at home is running fast but the Windows machine inside a company network is so slow, especially boot. “Nothing exists independently“, a verdict from Dalai Lama applies very much to performance engineering. I am not sure if he had that in mind but be prepared for unexpected twists during the following posts.

First a little background what you can expect in a corporate Windows network. Normally you have an Active Directory installation which takes care of user management. This involves rolling out central Group Policies defined by your administrators, deploying software centrally with e.g. MS System Center, rolling out virus scanners and virus pattern updates. This is happening in the background and you will not notice anything except that sometimes stuff is downloaded and installed and you are asked to reboot because some newly installed software or Windows update needs one.

After a reboot I was noticing that my machine was always working feverishly for the next 5 minutes on at least 2 cores. That is not good. Task manager was showing on my 4 Core notebook up to over 50% CPU utilization which means that I could barely work when something demanding was needed. The natural reaction to anyone who is dealing with performance issues on windows is to turn on ETW tracing. Why? Because the recording tool WPR is part of every Windows 10 installation. If you want to record for longer than one minute I would set the sampling rate down to 10ms to get not too much information like this:

C>wpr -start GeneralProfile -start CPU -start DiskIO -start FileIO -setprofint 100000

 

   Reproduce Issue

 

C>wpr -stop c:\temp\PerfIssue.etl

The CPU profile of WPR will always record context switch events which is interesting but produces a lot of data. I am still not sold to wpr as the successor to xperf because you need to extract the builtin profiles of the exe, create your own derivative profile and then save it to a file which is error prone and not easy to tailor. But for short recording durations it is okay to record pretty much anything so I do not complain. When you want to analyze the data you still need to download the Windows Performance Toolkit (WPT) from the Windows 10 SDK. Then you can apply the OverView.wpaProfile profile from ETWController which comes with many useful stacktags located here: https://github.com/Alois-xx/etwcontroller/tree/master/ETWController/ETW (get the complete folder).

image

We see a large CPU activity in WmiPrvSE.exe, svchost.exe and lsass.exe. What is this all about? On a live machine you can use ProcessHacker (the current version is 3.0.2824 do not download the old version 2.xxx from 2016) to easily see which processes are consuming CPU. A very useful but little known feature is that you can filter by multiple processes separated with | in the input window on the toolbar. It filters on all visible columns.

image

When you hover over the processes you immediately see which service or WMI provider is hosted in this process which can be very helpful.

image

WmiPrVSE.exe is a WMI Provider host process where the WMI provider objects are living. Whenever you call into a WMI object the call ends up in one of the several provider processes. The list WMI Providers tells us that this process contains the Event log provider. Lets turn to the next most CPU consuming process

image

This is the Event Log service process. Looks like the WMI provider calls into the Windows Event Log Service. Interesting.

image

lasss.exe is the process which manages all security related things inside Windows. This one tells you if you get access denied or not.

image

The last one is a Windows services which is the WMI service. That are a lot of different processes which are consuming CPU. But why? It looks like some WMI stuff is dealing with the Event Log but how can we find out which WMI query this could be?

The brute force approach is to check the memory contents of the WmiPrvSE.exe process and search for WMI queries which contain e.g. the select keyword. ProcessHacker can actually scan for strings in any process. Lets do that as quick investigation:

image

image

And behold we immediately get a nice list of current and past queries executed by WMI:

image

How can we verify that which WMI query was so slow? We do not need to write a C++/C# application. Instead we use Powershell which can execute any WMI Query with the Get-WmiObject -query “select …” cmdlet. Lets try to query all Windows Event Logs for errors (eventtype=1) and measure its duration with Measure-Command :

PS C:\>Measure-Command -Expression { Get-WmiObject -query "select Eventcode,Eventtype,Logfile,Message,Sourcename,TimeGenerated,user from Win32_NTLogEvent where eventtype = 1"  }
                                                                                                                                                                                 
Minutes           : 4                                                                                                                                                            
Seconds           : 5                                                                                                                                                            

Looks like we have found it. What is so slow with this WMI Query? The first observation is that the supplied query contains a time filter which tries to get events from the last 3 days. Normally you want to query a specific event log but this query reads from all event logs on the system all error (eventtype = 1)  messages.

Up to now we have identified an expensive WMI Query which reads all error events from the last 3 days. When you have a performance problem you have two options now:

  1. Find out who did call the query and prevent it from running
  2. Identify the root cause of the bad WMI Event Log Query performance

Lets start with option 1. Since Windows 8.1 we can enable WMI Activity Tracing and query for the WMI Execution events.

To enable WMI Tracing execute this on an elevated command prompt

                                                              
C>wevtutil sl Microsoft-Windows-WMI-Activity/Trace /e:true    

To query for WMI Query execution events we need to filter for event 11 which contains also the initiating process id

                                                                                                              
C>wevtutil qe Microsoft-Windows-WMI-Activity/Trace /q:"*[System[EventID=11]]" /f:text | findstr /i ExecQuery  
CorrelationId = {00000000-0000-0000-0000-000000000000}; GroupOperationId = 42920; OperationId = 42932; Operation = Start IWbemServices::ExecQuery - root\cimv2 : SELECT * FROM Win32_PhysicalMedia WHERE Tag="\\\\.\\PHYSICALDRIVE1"; ClientMachine = SKYRMION; User = NT AUTHORITY\SYSTEM; ClientProcessId
= 6276; NamespaceName = 132184730473612738 CorrelationId = {00000000-0000-0000-0000-000000000000}; GroupOperationId = 42992; OperationId = 42993; Operation = Start IWbemServices::ExecQuery - root\cimv2 : SELECT AdapterRAM FROM Win32_VideoController WHERE AdapterRAM >= 2147483648; ClientMachine = SKYRMION; User = NT AUTHORITY\SYSTEM; ClientPr
ocessId = 6276; NamespaceName = 132184730473612738 CorrelationId = {00000000-0000-0000-0000-000000000000}; GroupOperationId = 42996; OperationId = 42997; Operation = Start IWbemServices::ExecQuery - root\cimv2 : SELECT AdapterRAM FROM Win32_VideoController WHERE AdapterRAM >= 1073741824; ClientMachine = SKYRMION; User = NT AUTHORITY\SYSTEM; ClientPr
ocessId = 6276; NamespaceName = 132184730473612738 
CorrelationId = {00000000-0000-0000-0000-000000000000}; GroupOperationId = 43000; OperationId = 43001; Operation = Start IWbemServices::ExecQuery - root\cimv2 : SELECT AdapterRAM FROM Win32_VideoController WHERE AdapterRAM >= 4293918720; ClientMachine = SKYRMION; User = NT AUTHORITY\SYSTEM; ClientPr
ocessId = 6276; NamespaceName = 132184730473612738 
CorrelationId = {00000000-0000-0000-0000-000000000000}; GroupOperationId = 43046; OperationId = 43047; Operation = Start IWbemServices::ExecQuery - root\cimv2 : SELECT AdapterRAM FROM Win32_VideoController WHERE AdapterRAM >= 2147483648; ClientMachine = SKYRMION; User = NT AUTHORITY\SYSTEM; ClientPr
ocessId = 6276; NamespaceName = 132184730473612738 
CorrelationId = {00000000-0000-0000-0000-000000000000}; GroupOperationId = 43050; OperationId = 43051; Operation = Start IWbemServices::ExecQuery - root\cimv2 : SELECT AdapterRAM FROM Win32_VideoController WHERE AdapterRAM >= 1073741824; ClientMachine = SKYRMION; User = NT AUTHORITY\SYSTEM; ClientPr
ocessId = 6276; NamespaceName = 132184730473612738 
CorrelationId = {00000000-0000-0000-0000-000000000000}; GroupOperationId = 43054; OperationId = 43055; Operation = Start IWbemServices::ExecQuery - root\cimv2 : SELECT AdapterRAM FROM Win32_VideoController WHERE AdapterRAM >= 4293918720; ClientMachine = SKYRMION; User = NT AUTHORITY\SYSTEM; ClientPr
ocessId = 6276; NamespaceName = 132184730473612738 
CorrelationId = {C9FD2187-9C4E-0004-3FBF-FEC94E9CD501}; GroupOperationId = 43058; OperationId = 43059; Operation = Start IWbemServices::ExecQuery - root\cimv2 : select Eventcode,Eventtype,Logfile,Message,Sourcename,TimeGenerated,user from Win32_NTLogEvent where (type = 2 ) ; ClientMachine = SKYRMION
; User = Skyrmion\Alois; ClientProcessId = 1112; NamespaceName = 132183970163500741

With the process ID it is trivial to identify which process did initiate the WMI Query. Who is it?

image

lspush.exe? Never heard of. This one is not from Microsoft

image

A search later we get to https://www.lansweeper.com/knowledgebase/how-to-scan-with-lspush/. Lansweeper is a network inventory tool which gathers client machine information on Windows in a general way by e.g. calling it at boot or login via a group policy. There is a free edition which I did install at home to reproduce the issue. And lspush did show the same behavior on my machine:

image

It is of course much easier to spot the issue when you see the Lansweeper UI where the scan operation is even visible in the UI. In corporate networks such things are normally run in the background and you need to resort to advanced tooling to drill deeper. I have contacted Lansweeper if there is something I could to about it. I have got as answer back that other customers have complained as well but this issue will only be tackled when enough customers do complain. Currently there is NO way to turn this offensive event log query off which is frustrating. I know Lansweeper has many customers but not many seem to have been able to correlate lspush induced CPU actvitiy in WMI related processes and Lansweeper yet. If you are a Lansweeper customer and you are using lspush and not the agent approach to scan your machines you definitely should reach out to their support to raise the priority of the backlog item. Lansweeper is used by many big companies (https://www.lansweeper.com/) where definitely more than one will experience this issue. When addressing this the first time to our IT guys they did not believe me because the Lansweeper UI shows also how long the scan was taking:

image

According to the Lansweeper UI my scan time is in the sub second region. What this report is failing to show are the scans which cannot be turned off and how long they did take…

In meantime if you are a Lansweeper customer measure how long lspush.exe is executing in your setup and call their support to prevent wasted CPU cycles at scale and bad user experience after logon. This issue is not happening everywhere which brings me back to point 2. Why is this Eventlog query so slow? This is the topic for the next post.