PDD Profiler Driven Development

There are many extreme software development strategies out there. PDD can mean Panic Driven, or Performance Driven Development. The first one should be avoided, the second is a MSR paper dealing with performance modelling techniques. Everyone writes tests for their software because they have proven to be useful, but why are so few people profiling their own code? Is it too difficult?

I have created a C# console project named FileWriter https://github.com/Alois-xx/FileWriter which writes 5000 100KB files (= 500MB) to a folder with different strategies. How do you decide which strategy is the most efficient and/or fastest one? As strategies we have

  • Single Threaded
  • Multi Threaded with TPL Tasks
  • Multi Threaded with Parallel.For
    • 2,3,4,5 Threads

The serial version is straightforward. It creates a n files and writes in each file a string until the file size is reached. The code is located at https://github.com/Alois-xx/FileWriter/blob/master/DataGenerator.cs, but the exact details are not important.

Lets measure how long it takes:

As expected the serial version is slowest and the parallel versions are faster. The fastest one is the one which uses 3 threads. Based on real world measurements we have found that 3 threads is the fastest solution. OK problem solved. Move on and implement the next feature. That was easy.

But, as you can imagine there is always a but: Are the numbers we have measured reasonable? Lets do some math. We write 500 MB with the parallel version in 11s. That gives a write speed of 45 MB/s. My mental performance model is that we should be able to max out a SATA SSD which can write 430 MB/s.

This is off nearly by a factor 10 what the SSD is able to deliver. Ups. Where did we loose performance? We can try out the Visual Studio CPU Profiler:

That shows that creating and writing to the file are the most expensive parts. But why? Drilling deeper will help to some extent, but we do not see where it did wait for disk writes because the CPU Profiler is only showing CPU consumption. We need another tool. Luckily there is the Concurrency Visualizer extension for free on the Visual Studio Marketplace, which can also show Thread blocking times.

After installation you can start your project from the Analyze menu with your current debugging settings.

We try to stay simple and start FileWriter with the single threaded strategy which is easiest to analyze

FileWriter -generate c:\temp serial

In the Utilization tab we get a nice summary of CPU consumption of the machine. In our case other processes were consuming a lot of CPU while we were running. White is free CPU, dark gray is System and light gray are all other processes. Green is our profiled process FileWriter.exe.

The first observation is that once we start writing files other processes become very busy which looks correlated. We come back to that later.

The Threads view shows where the Main Thread is blocked due to IO/Waiting/…

You can click with the mouse on the timeline of any thread to get in the lower view the corresponding stack trace. In this case we are delayed e.g. by 2,309 s by a call to WriteFile. For some reason the .NET 6.0 symbols could not be resolved but it might also be just a problem with my symbol lookup (there should be a full stack trace visible). Personally I find it difficult to work with this view, when other processes are consuming the majority of CPU. But since Concurrency Visualizer uses ETW there is nothing that can stop us to open the Concurrency Visualizer generated ETW files from the folder C:\Users\\Documents\Visual Studio 2022\ConcurrencyVisualizer\ with WPA.

We see that FileWriter (green) starts off good but then MsMpEng.exe (red = Defender Antivirus) is kicking in and probably delaying our File IO significantly. In the middle we see System (lila = Windows OS) doing strange things (Antivirus again?).

Do you remember that our SSD can write 430 MB/s? The Disk Usage view in WPA shows disk writes. Our process FileWriter writes 500 MB of data, but there are only 4 MB written in 53ms (Disk Service Time). Why are we not writing to the SSD? The answer is: The Operating System is collecting a bunch of writes and later flushes data to disk when there is time or the amount of written data exceeds some internal threshold. Based on actual data we need to revise our mental performance model:

We write to the OS write cache instead of the disk. We should be able to write GB/s! The actual disk writes are performed by the OS asynchronously.

Still we are at 45 MB/s. If the disk is not the problem, then we need to look where CPU is spent. Lets turn over to the gaps where the green line of FileWriter writing on one thread to disk vanishes. There seems to be some roadblock caused by the System process (Lila) which is the only active process during that time of meditation:

When we unfold the Checkpoint Volume stacktag which is part of my Overview Profile of ETWController we find that whenever NTFS is optimizing the Free Bitmap of the NTFS volume it is traversing a huge linked list in memory which blocks all File IO operations for the complete disk. Do you see the bluish parts in the graph which WPA shows where NTFS Checkpoint volume samples are found in the trace? Automatic highlighting of the current table selection in the graph is one of the best features of WPA. The root cause is that we delete the old files of the temp folder in FileWriter from a previous run which produces a lot of stale NTFS entries which, when many new files are created, at some point cause the OS to wreak havoc on file create/delete performance. I have seen dysfunctional machines which did have a high file churn (many files per day were created/deleted) where all file create/delete operations did last seconds instead of sub milliseconds. See a related post about that issue here: https://superuser.com/questions/1630206/fragmented-ntfs-hdd-slow-file-deletion-600ms-per-file. I did hit this issue several times when I cleaned up ETW folders which contain for each ETL file a large list of NGENPDB folders which the .ni.pdb files. At some point Explorer is stuck for many minutes to delete the last files. The next day and a defrag later (the NTFS tables still need defragmentation even for SSDs) things are fast again. With that data we need to revisit our mental performance model again:

Creating/Deleting large amount of files might become very slow depending on internal NTFS states which are in memory structures and manifest as high CPU consumption in the Kernel inside the NtfsCheckpointVolume method.

That view is already good, but we can do better if we record ETW data on our own and analyze it at scale.

How about a simple batch script that

  • Starts ETW Profiling
  • Executes your application
  • Retrieves the test execution time as return value from your application
  • Stops ETW Profiling
  • Write an ETW file name with the measured test duration and test case name
  • Extract ETW Data for further analysis

That is what the RunPerformanceTests.cmd script of the FileWriter project does. The tested application is FileWriter.exe which writes files with different strategies to a folder. When it exits it uses as exit code of the process the duration of the test case in ms. For more information refer to ProfileTest.cmd command line help.

FileWriter and Measurement Details

Read on if you are eager for details how the scripts generate ETW measurement data or go to the next headline.

RunPerformanceTests.cmd
call ProfileTest.cmd CreateSerial       FileWriter.exe -generate "%OutDir%" serial
call ProfileTest.cmd CreateParallel     FileWriter.exe -generate "%OutDir%" parallel
call ProfileTest.cmd CreateParallel2    FileWriter.exe -generate "%OutDir%" parallel -threads 2
...
call ProfileTest.cmd CreateTaskParallel FileWriter.exe -generate "%OutDir%" taskparallel

The most “complex” of ProfileTest.cmd is where several ETW profiles of the supplied MultiProfile.wprp are started. The actual start command in the ProfileTest.cmd script is

wpr -start MultiProfile!File -start MultiProfile.wprp!CSWitch -start MultiProfile.wprp!PMCLLC 

But with shell scripting with the correct escape characters it becomes this:

"!WPRLocation!" -start "!ScriptLocation!MultiProfile.wprp"^^!File -start "!ScriptLocation!MultiProfile.wprp"^^!CSwitch -start "!ScriptLocation!MultiProfile.wprp"^^!PMCLLC

This line starts WPR with File IO tracing, Context Switch tracing to see thread wait/ready times and some CPU counters for Last Level Cache misses. Disk and CPU sampling is already part of these profiles.

The PMCLLC profile can cause issues if you are using VMs like HyperV. To profile low level CPU features you need to uninstall all Windows OS HyperV Features to get Last Level Cache CPU tracing running.

You can check if you have access to your CPU counters via

C:\>wpr -pmcsources
Id  Name                        
--------------------------------
  0 Timer                       
  2 TotalIssues                 
  6 BranchInstructions          
 10 CacheMisses                 
 11 BranchMispredictions        
 19 TotalCycles                 
 25 UnhaltedCoreCycles          
 26 InstructionRetired          
 27 UnhaltedReferenceCycles     
 28 LLCReference                
 29 LLCMisses                   
 30 BranchInstructionRetired    
 31 BranchMispredictsRetired    
 32 LbrInserts                  
 33 InstructionsRetiredFixed    
 34 UnhaltedCoreCyclesFixed     
 35 UnhaltedReferenceCyclesFixed
 36 TimerFixed                  

If you get a list with just one entry named Timer then you will not get CPU counters, because you have installed one or several HyperV features. In theory this issue should not exist because this was fixed with Windows 10 RS 5 in 2019 (https://docs.microsoft.com/en-us/archive/blogs/vancem/perfview-hard-core-cpu-investigations-using-cpu-counters-on-windows-10), but I cannot get these CPU counters when HyperV is active in Windows 10 21H2.

The easiest fix is to remove the “!ScriptLocation!MultiProfile.wprp”^^!PMCLLC from the ProfileTest.cmd script, or you modify the MultiProfile.wprp to set in all HardwareCounter nodes

<HardwareCounter Id="HardwareCounters_EventCounters" Base="" Strict="true">

the value Strict=”false” to prevent WPR from complaining when it could not enable hardware counters. The MultiProfile.wprp profile uses nearly all features of WPR which are possible according to the not really documented xsd Schema to streamline ETW recording to your needs. The result of this “reverse” documenting is an annotated recording profile that can serve as base for your own recording profiles. Which profiles are inside it? You can query it from the command line with WPR:

D:\Source\FileWriter\bin\Release\net6.0>wpr -profiles MultiProfile.wprp

Microsoft Windows Performance Recorder Version 10.0.19041 (CoreSystem)
Copyright (c) 2019 Microsoft Corporation. All rights reserved.

   Default                   
               (CPU Samples/Disk/.NET Exceptions/Focus)
   CSwitch    +(CPU Samples/Disk/.NET Exceptions/Focus/Context Switch)
   MiniFilter +(CPU Samples/Disk/.NET Exceptions/Focus/MiniFilter)
   File       +(CPU Samples/Disk/.NET Exceptions/Focus/File IO)
   Network    +(CPU Samples/Disk/.NET Exceptions/Focus/Network)
   Sockets    +(CPU Samples/Disk/.NET Exceptions/Focus/Sockets)
   VirtualAlloc (Long Term)
   UserGDILeaks (Long Term)
   PMCSample  
     PMC Sampling for PMC Rollover + Default
   PMCBranch  
     PMC Cycles per Instruction and Branch data - Counting
   PMCLLC                      
     PMC Cycles per Instruction and LLC data - Counting
   LBR                         
     LBR - Last Branch Record Sampling

Or you can load the profile into WPRUI which is part of the Windows SDK where they will show up under the Custom Measurements node:

All profiles can be mixed (except Long Term) to record common things for unmanaged and .NET Code with the least amount of data. A good choice is the default profile which enables CPU sampling, some .NET events, Disk and Window Focus events. These settings are based on years of troubleshooting, and should be helpful recording settings also in your case. The supplied MS profiles are recording virtually everything which is often a problem that the trace sizes are huge (many GB) or, if memory recording is used, the history is one minute or even less on a busy system which has problems.

Generated Data

RunPerformanceTests.cmd saves the ETW data to the C:\Temp folder

After data generation it calls ExtractETW.cmd which is one call to ETWAnalyzer which loads several files in parallel and creates Json files from the ETW files

ETWAnalyzer -extract all -fd c:\temp\*Create*.etl -symserver MS -nooverwrite

This will generate JSON files in C:\Temp\Extract which can be queried with ETWAnalyzer without the need to load every ETL file into WPA.

As you can see the 500+ MB ETL files are reduced to ca. 10 MB which is a size reduction of a factor 50 while keeping the most interesting aspects. There are multiple json files per input file to enable fast loading. If you query with ETWAnalyzer e.g. just CPU data the other *Derived* files are not even loaded which keeps your queries fast.

Working With ETWAnalyzer and WPA on Measured Data

To get an overview if all executed tests suffer from a known issue you can query the data from the command line. To get the top CPU consumers in the Kernel not by method name but by stacktag you can use this query

EtwAnalyzer -dump CPU -ProcessName System -StackTags * -TopNMethods 2

I am sure you ask yourself: What is a stacktag? A stacktag is a descriptive name for key methods in a stacktrace of an ETW event. WPA and TraceProcessing can load a user configured xml file which assigns to a stacktrace a descriptive (stacktag) name, or Other if no matching stacktag rule could be found. The following XML fragment defines e.g. the above NTFS Checkpoint Volume stacktag:

  <Tag Name="NTFS Checkpoint Volume">
	<Entrypoint Module="Ntfs.sys" Method="NtfsCheckpointVolume*"/>
  </Tag>

See default.stacktags of ETWAnalyzer how it can be used to flag common issues like OS problems, AV Scanner Activity, Encryption Overhead, … with descriptive stacktags. If more than one stacktag matches the first matching stacktag, deepest in the stacktrace, wins. You can override this behavior if you set a priority value for a given tag.

Based on the 7 runs we find that the CheckpointVolume issue is part of every test which blocks File writes up to 6+ seconds. Other test runs do no suffer from this which makes comparison of your measured numbers questionable. Again we need to update our mental performance model:

Tests which are slower due to CheckpointVolume overhead should be excluded to make measurements comparable where this internal NTFS state delay is not occurring.

The exact number would be the sum of CPU+Wait, because this operation is single threaded we get exact numbers here. ETWAnalyzer can sum method/Stacktags CPU/Wait times with the -ShowTotal flag. If we add Method, the input lines are kept, if we use Process only process and file totals are shown.

EtwAnalyzer -dump CPU -ProcessName System -StackTags *checkpoint* -ShowTotal method

Now we know that file IO is delayed up to 7461 ms without the need to export the data to Excel. ETWAnalyzer allows you to export all data you see at the console to a CSV file. When you add -csv xxx.csv and you can aggregate/filter the data further as you need it. This is supported by all -Dump commands!

We can also add methods to the stacktags by adding -methods *checkpoint* to see where the stacktag has got its data from

EtwAnalyzer -dump CPU -ProcessName System -StackTags *checkpoint* -methods *checkpoint*
The method NtfsCheckpointVolume has 873 ms, but the stacktag has only 857 ms. This is because other stacktags can “steal” CPU sampling/wait data which is attributed to it. This is curse and feature rolled into one. It is great because if you add up all stacktags then you get the total CPU/Wait time for a given process across all threads. It is bad, because expensive things might be looking less expensive based on one stacktag because other stacktags have “stolen” some CPU/Wait time.

That is all nice, but what is our FileWriter process doing? Lets concentrate on the single threaded use case by filtering the files with -FileDir or -fd which is the shorthand notation for it to *serial*. You need to use * in the beginning, because the filter clause uses the full path name. This allows you to query with -recursive a complete directory tree where -fd *Serial*;!*WithAV* -recursive defines a file query starting from the current directory for all Serial files but it excludes folders or files which contain the name WithAV. To test which files are matching you can use -dump Testrun -fd xxxx -printfiles.

EtwAnalyzer -dump CPU -ProcessName FileWriter -stacktags * -Filedir *serial*

The test takes 12s but it spends 2,2s yet alone with Defender Activity. Other stacktags are not giving away a clear problem we could drill deeper. One interesting thing is that we spend 750ms in thread sleeps. A quick check in WPA shows

that these are coming from the .NET Runtime to recompile busy methods later on the fly. Stacktags are great to flag past issues and categorize activity, but they also tend to decompose expensive operations into several smaller ones. E.g. if you write to a file you might see WriteFile and Defender tags splitting the overall costs of a call to WriteFile. But what stands out that on my 8 Core machine a single threaded FileWriter fully utilizes my machine with Defender activity. We need to update our mental performance model again:

Writing to many small files on a single thread will be slowed down by the Virus Scanner a lot. It also generates a high load on the system which slows down everything else.

Since we did write the application we can go down to specific methods to see where the time is spent. To get an overview we use the top level view of WPA of our CreateFile method

In WPA we find that

  • CreateFiles
    • CPU (CPU Usage) 10653 ms
    • Wait Sum 672 ms
    • Ready Time Sum 1354 ms
      • This is the time a thread did wait for a CPU to become free when all CPUs were used

which adds up to 12,679 s which pretty closely (16ms off) matches the measured 12,695 s. It is always good to verify what you did measure. As you have seen before we have already quite often needed to change our mental performance model. The top CPU consumers in the CreateFiles method are

  • System.IO.StreamWriter::WriteLine
  • System.IO.Strategies.OSFileStreamStrategy::.ctor
  • System.IO.StreamWriter::Dispose
  • System.Runtime.CompilerServices.DefaultInterpolatedStringHandler::ToStringAndClear
  • System.Buffers.TlsOverPerCoreLockedStacksArrayPool`1[System.Char]::Rent
  • System.Runtime.CompilerServices.DefaultInterpolatedStringHandler::AppendFormatted
  • System.IO.Strategies.OSFileStreamStrategy::.ctor
  • System.IO.StreamWriter::WriteLine

What is surprising is that ca. 2s of our 12s are spent in creating interpolated strings. The innocent line

 string line = $"echo This is line {lineCount++}";

costs ca. 16% of the overall performance. Not huge, but still significant. Since we are trying to find out which file writing strategy is fastest this is OK because the overhead will be the same for all test cases.

We can get the same view as in WPA for our key methods with the following ETWAnalyzer query

EtwAnalyzer -dump CPU -ProcessName FileWriter -fd *serial* -methods *StreamWriter.WriteLine*;*OSFileStreamStrategy*ctor*;*StreamWriter.WriteLine*;*DefaultInterpolatedStringHandler*AppendFormatted*;*TlsOverPerCoreLockedStacksArrayPool*Rent;*DefaultInterpolatedStringHandler.ToStringAndClear;*StreamWriter.Dispose;*filewriter* -fld

You might have noticed that the method names in the list have :: between the class and method. ETWAnalyzer uses always . which is less typing and is in line with old .NET Framework code which used :: for JITed code and . for precompiled code.

Additionally I have smuggled -fld to the query. It is the shorthand for -FirstLastDuration or -fld which shows the difference between the Last-First time the method did show up in CPU Sampling or Context switch data. Because we know that our test did only measure DataGenerator.CreateFile calls we see in the Last-First column 12,695 s which is down to the ms our measured duration! This option can be of great value if you want to measure parallel asynchronous activity wall clock time. At least for .NET code the generated state machine classes of the C# compiler contain central methods which are invoked during init and finalization of asynchronous activities which makes it easy to “measure” the total duration even if you have no additional trace points at hand. You can add to -fld s s or -fld local local or -fld utc utc to also show the first and last time in

  • Trace Time (seconds)
  • Local Time (customer time in the time zone the machine was running)
  • UTC Time

See command line help for further options. ETWAnalyzer has an advanced notation of time which can be formatted in the way you need it.

The WPA times for Wait/Ready are the same in ETWAnalyzer, but not CPU. The reason is that to exactly judge the method CPU consumption you need to look in WPA into CPU Usage (Sampled) which shows per method CPU data, which is more accurate. ETWAnalyzer merges this automatically for you.

CPU usage (Precise) Trap

The view CPU usage (Precise) in WPA shows Context switch data which is whenever your application called a blocking OS call like Read/WriteFile/WaitForSingleObject, … your thread is switched off the CPU which is called context switch. You will see in this graph therefore only (simplification!) methods which were calling a blocking OS method. If you have e.g. a busy for loop in method int Calculate() and the next called method is Console.WriteLine, like this,

int result = Calculate();
Console.WriteLine(result)

to print the result then you will see in CPU usage (Precise) all CPU attributed to Console.WriteLine because that method was causing the next blocking OS call. All other non blocking methods called before Console.WriteLine are invisible in this view. To get per method data you need to use the CPU data of CPU sampling which gets data in a different way to much better measure CPU consumption at method level. Is Context Switch data wrong? No. The per thread CPU consumption is exact because this is a trace of the OS Scheduler when he moves threads on/off a CPU. But you need to be careful to not interpret the method level CPU data as true CPU consumption of that method.

CPU Usage (Sampled) Trap

Lets compare both CPU views (Sampled) vs (Precise)

StreamWriter.Dispose method consumes 887 ms in Sampling while in Context Switch view we get 1253 ms. In this case the difference is 41%! With regards to CPU consumption you should use the Sampled view for methods to get the most meaningful value.

Every WPA table can be configured to add a Count column. Many people fell into the trap to interpret the CPU Sampling Count column as number of method calls, while it actually just counts the number of sampling events. CPU sampling works by stopping all running threads 1000 (default) times per second, take a full stacktrace and write that data to ETW buffers. If method A() did show up 1000 times then it gets 1000*1ms CPU attributed. CPU sampling has statistical bias and can ignore e.g. method C() if it is executing just before, or after, the CPU sampling event did take a stack trace.

ETWAnalyzer Wait/Ready Times

The method list shown by ETWAnalyzer -Dump CPU -methods xxx is a summation across all threads in a process. This serves two purposes

  • Readability
    • If each thread would be printed the result would become unreadable
  • File Size
    • The resulting extracted Json file would become huge

This works well for single threaded workloads, but what about 100 threads waiting for something? WPA sums all threads together if you remove the thread grouping. ETWAnalyzer has chosen a different route. It only sums all non overlapping Wait/Ready times. This means if more than one thread is waiting, the wait is counted only once.

The Wait/Ready times for methods for multi threaded methods are the sum of all non overlapping Wait/Ready times for all threads. If e.g. a method was blocked on many threads you will see as maximum wait time your ETW recording time and not some arbitrary high number which no one can reason about. It is still complicated to judge parallel applications because if you have heavy thread over subscription you will see both, Wait and Ready, times to reach your ETW recording time. This is a clear indication that you have too many threads and you should use less of them.

Make Measurements Better

We have found at least two fatal flaws in our test

  • Anti Virus Scanner is consuming all CPU even in single threaded test
    • Add an exclude rule to the folder via PowerShell
      • Add-MpPreference -ExclusionPath “C:\temp\Test”
    • To remove use
      • Remove-MpPreference -ExclusionPath “C:\temp\Test”
  • Some test runs suffer from large delays due to CheckpointVolume calls which introduces seconds of latency which is not caused by our application
    • Change Script to delete temp files before test is started to give OS a chance to get into a well defined file system state

When we do this and we repeat the tests a few times we get for the single threaded case:

Nearly a factor 3 faster for changing not a single line of code is pretty good. The very high numbers are coming from overloading the system when something else is running in the background. The 12s numbers are for enabled Defender while < 10s are with a Defender exclusion rule but various NtfsCheckpoint delays. When we align everything correctly then we end up with 7s for the single threaded case which is 71 MB/s of write rate. The best value is now 3,1s for 4 threads with 161 MB/s.

What else can we find? When looking at variations we find also another influencing factor during file creation:

We have a single threaded test run with 9556 ms and a faster one with 7066 ms. Besides the CheckpointVolume issue we loose another 1,3 s due to reading MFT and paged out data. The faster test case (Trace #2) reads just 0,3 ms from the SSD, while we read in the slower Run (Trace #1) 1,3s from disk. The amount of data is small (8 MB), but for some reason the SSD was handing out the data slowly. Yes SSDs are fast, but sometimes, when they are busy with internal reorg things, they can become slow. SSD performance is a topic for another blog post.

We need to update our mental performance model

File System Cache state differences can lead to wait times due to Hard faults. Check method MiIssueHardFault if the tests have comparable numbers.

Etwanalyzer -dump CPU -pn Filewriter -methods MiIssueHardFault
That are another 2 s of difference which we can attribute to File System cache state and paged out data.

Measuring CPU Efficiency

We know that 4 threads are fastest after removing random system noise from our measured data. But is is it also the most efficient one? We can dump the CPUs Performance Monitoring Counters (PMC) to console with

ETWAnalyzer -dump PMC -pn FileWriter

The most fundamental metric is CPI which is Cycles/Instructions. If you execute the the same use case again you should execute the same numbers of instructions. When you go for multithreading you need to take locks and coordinate threads which costs you extra instructions. The most energy efficient solution is therefore the single threaded solution, although it is slower.

To get an overview you can export data into a CSV file from the current directory and the initial measurements we did start with. The following query will dump the Performance Monitoring Counters (PMC) to a CSV file. You can use multiple -fd queries to select the data you need:

ETWAnalyzer -dump PMC -pn FileWriter -fd ..\Take_Test1 -fd .  -csv PMC_Summary.csv

The CSV file contains the columns Test Time in ms and CPI to make it easy to correlate the measured numbers with something else e.g. CPI in this case. This pattern is followed for all -Dump commands of ETWAnalyzer. From this data we can plot Test Timing vs CPI this is Cycles per Instruction into a Pivot Table:

The initial test where we had

  • Defender
  • CheckpointVolume

issues was not only much slower, but CPU efficiency wise much worse: 1,2 vs 0,69 (smaller is better). This proves that even if Defender would be highly efficient on all other cores and not block our code at all we still would do much worse (73%), because even when we do exactly the same, CPU caches and other internal CPU resources, are slowing us down. With this data you can prove why you are slower even when everything else is the same.

Conclusions

If you have read that far I am impressed. That is a dense article with a lot of information. I hope I have convinced you that profiling your system with ETW and not only your code for strategic test cases is an important asset in automated regression testing.

Credits: NASA, ESA, CSA, and STScI (https://www.nasa.gov/image-feature/goddard/2022/nasa-s-webb-delivers-deepest-infrared-image-of-universe-yet)

If you look only at your code you will miss the universe of things that are moving around in the operating system and other processes. You will end up with fluctuating tests which no one can explain. We are doing the same thing 10 times and we still get outliers! The key is not only to understand your own code, but also the system in which it is executing. Both need attention and coordination to achieve the best performance.

To enable that I have written ETWAnalyzer to help you to speed up your own ETW analysis. If you go for ETW you are talking about a lot of data which needs a fair amount of automation to become manageable. Loading hundreds of tests into WPA is not feasible, but with ETWAnalyzer mass data extraction and analysis never was easier. The list of supported queries is growing. If you miss something file an issue so we can take a look what the community is needing.

To come back to the original question. Is profiling too difficult? If you are using only UI based tools then the overhead to do this frequently it is definitely too much. But with the scripting approach shown by FileWriter to record data and extract it with ETWAnalyzer, to make the data queryable, is giving you a lot of opportunities. ETWAnalyzer comes with an object model for the extracted Json files. You can query for known issues directly and flag past issues within seconds without the need to load the profiling data into a graphical viewer to get an overview.