War Story: How Antivirus solutions can bring a server down

From a performance point of view Antivirus scanners are the worst things that have happened to computers during the last ca. 5 years. Sometimes it is hard to differentiate the CPU activity of an AV scanner from a bitcoin miner. Both are eating your CPU to give you either virtual security or virtual money.

Whenever something is mysterious and slow I get a call. In this case a beefy Windows Server was working happily, but a few times a day everything would freeze for up to 5 minutes. The solution to get back to speed was to reboot the server when it got stuck. When the server got stuck, it was at 100% CPU for several minutes. That sounds like an obvious CPU problem. Identify the problematic process and uninstall the app which is causing problems.

To identify the issue ETW Traces were taken. Everything was working fine until the ETW profiling events stopped in the recording. Several minutes later the CPU problem did start. Was someone tinkering with ETW Traces? Shortly before the trace did stop I did see an installer for the Palo Alto Cortex AV solution executing. That was interesting. Normally I record data with xperf which uses the NT Kernel Logger ETW session. Xperf supports another kernel session named Circular Kernel Context Logger session which is always there since at least Windows 7. When running ETW tracing for a longer time I record with xperf into a file which is stopped every few minutes. Profiling is then restarted (no merging of the files is done). That way I can later choose from the many files the interesting ones when the issue is reported again. Although I default to 10ms sample rate one needs still up to 40 GB of disk space to record a full day.

The Issue occurred again

Below is a picture of total CPU consumption over a period of two minutes. Green is Idle (means CPU is free) and red is the CPU consumed by the Windows Sleep API call which needs ca. 60% for all cores for more than one minute. Why is sleeping consuming so much CPU?

There is a specific workload which occurs only on many core servers: Spinlocks are locks which try to burn some CPU on all cores before all spinning (= trading CPU for decreased latency) threads enter the expensive OS lock. The usual implementation of a Spinlock uses several stages

  1. Check if lock can be acquired
  2. If not
    • For(i..n)
      • Burn some cpu cycles by calling pause instruction in a second loop
      • Check if lock can be acquired and exit loop
      • Call Sleep(0) to give another thread to do some useful work
      • Check if lock can be acquired and exit loop
    • Enter OS lock
  3. Lock is acquired

Some high performance threading libraries like the Intel OpenMP library use excessive amounts of CPU spinning for up to 200 ms to almost never take a lock to get, at insane power consumption costs, a few microseconds of faster algorithm execution. This means if you let one empty task run every 200 ms you will get this CPU graph when using the Intel OpenMP library:

Sure, your CPU looks super busy, but did you perform much useful work? Not at all. The Intel OpenMP library makes it look like you are doing heavy number crunching and you need a bigger processor to take that load.

If you set the environment variable KMP_BLOCKTIME=0 then you get a flat CPU profile as it should have been always the case. The decision of Intel to waste 200 ms of CPU cycles by default is highly questionable, and no one, after profiling the CPU consumption, would leave it at the default value.

What has this to do with Virus scanners? They have the habit to intercept system calls in creative ways, by using OS callbacks, application detouring and ETW. E.g Windows 11 23H2 has instrumented many rarely used system calls with ETW trace points to make it easier to detect unusual application activity.

Interlude – SysCall Tracing

What can you do if you want to see every operating system call performed by an application. On Linux you can run the application with the strace tool to get a nice console output of all called operation system calls. On Windows no comparable tool exists. But there is a kernel ETW provider to enable SYSCALL tracing which not many people are aware of.

In a wpr profile you enable a SYSCALL Tracing by adding the SystemCall keyword and if you want you can even get the stacktrace for each method call (warning high data volume!). The logged events alone are not that useful, because they are missing the process and thread which did issues the call. To make these events useful you need to record also Context Switch events which record how long a thread was running on a given CPU.

Below is a sample snippet of a typical .wprp recording profile for wpr to enable SysCall tracing:

...
<Keywords>
   <Keyword Value="SystemCall"/>
   <Keyword Value="CSwitch"/>
</Keywords>
<Stacks>
  <Stack Value="SystemCallEnter"/>
</Stacks>

A full SysCall recording profile is located in the FileWriter project. To use it, download the MultiProfile.wprp with curl which is part of Windows. Then you can use it like this:

curl https://raw.githubusercontent.com/Alois-xx/FileWriter/master/MultiProfile.wprp > MultiProfile.wprp
wpr -start MultiProfile.wprp!SysCall
wpr -stop c:\temp\Wpr_SysCall.etl

Or if you use prefer xperf:

xperf -on base+syscall+CSWITCH -stackwalk SyscallEnter 
...
xperf -stop 
xperf -merge c:\kernel.etl SyscallTracing.etl

The merge step is needed to get method names, otherwise you just get addresses in the resulting trace. You can now open the Trace with WPA (Windows Performance analyzer). Under System Activity – Syscalls a new graph pops up when you load such a trace which allows you to analyze the call duration, frequency, return value and other things:

Unlike strace of linux you cannot get the method input argument but you can get at least the return code which is better than nothing.

We can dump the data also with the help of the TraceProcessing Library with a few lines

using Microsoft.Windows.EventTracing;
using Microsoft.Windows.EventTracing.Symbols;
using Microsoft.Windows.EventTracing.Syscalls;
using System.Diagnostics;
namespace strace
{
    internal class Program
    {
        static void Main(string[] args)
        {
            var processor = TraceProcessor.Create(args[0], new TraceProcessorSettings
            {
                AllowLostEvents = true,
            });
            string symServer = "*http://msdl.microsoft.com/download/symbols";
            if (args.Length > 1 && args[1] == "-nosymserver")
            {
                symServer = ""; // Loading MS symbols makes loading the trace much slower. Once loaded with symbols you can omit them
            }
            
            var symbols = processor.UseSymbols();
            var syscalls = processor.UseSyscalls();
            var sw = Stopwatch.StartNew();
            Console.WriteLine("Loading trace");
            processor.Process();
            Console.WriteLine($"Trace loaded in {sw.Elapsed.TotalSeconds:F0} s");
            sw = Stopwatch.StartNew();
            Console.WriteLine("Loading symbols");
            SymbolPath symPath = new($@"SRV*C:\Symbols{symServer}");
            SymCachePath symCache = new(@"C:\SymCache");
            symbols.Result.LoadSymbolsAsync(symCache, symPath).Wait();
            Console.WriteLine($"Symbols loaded in {sw.Elapsed.TotalSeconds:F0} s");
            foreach(ISyscall call in syscalls.Result.Syscalls)
            {
                if( call.ReturnValue == null)  // this was a syscall enter call
                {
                    continue;
                }
                string? funcName = call?.StackFrame.Symbol?.FunctionName;
                string? image = call?.Process?.ImageName;
                if (funcName != null && image != null)
                {
                    Console.WriteLine($"{image}({call?.Process.Id}): {funcName} <- 0x{call?.ReturnValue:X}");
                }
            }
        }
    }
}

Our much less inferior Windows strace tool can generate this output for e.g. all non succeeded (return code != 0x0) calls when our own strace process did crash because a not existing file was tried to access:

C>strace c:\syscallCSwitch.etl | findstr /i strace | findstr /v 0x0 
strace.exe(17016): NtOpenThreadTokenEx <- 0xC000007C
strace.exe(17016): NtOpenThreadTokenEx <- 0xC000007C
strace.exe(17016): NtQueryValueKey <- 0xC0000034
strace.exe(17016): NtOpenKey <- 0xC0000034
strace.exe(17016): NtQuerySecurityAttributesToken <- 0xC0000225
strace.exe(17016): NtQuerySecurityAttributesToken <- 0xC0000225
strace.exe(17016): NtQueryValueKey <- 0xC0000034
strace.exe(17016): NtOpenThreadTokenEx <- 0xC000007C
strace.exe(17016): NtOpenThreadTokenEx <- 0xC000007C
strace.exe(17016): NtSetInformationVirtualMemory <- 0xC00000BB
strace.exe(17016): NtQueryValueKey <- 0xC0000034
strace.exe(17016): NtOpenSemaphore <- 0xC0000034
strace.exe(17016): NtQueryValueKey <- 0xC0000034
strace.exe(17016): NtSetInformationVirtualMemory <- 0xC00000BB

The output is interesting, but without the input arguments we cannot make much progress if the input was e.g. the wrong file or if the failed call was even trying to access our file or something entirely different.

Antivirus Problem

Lets get back to our problem of high CPU in a Sleep call. When we drill deeper into the Sleep call CPU consumption we find in the stacktrace (the numbers are in ms)

KernelBase.dll!SleepEx 2 522 541 
|- ntdll.dll!NtDelayExecution 2 510 323.140645
 |    |- ntoskrnl.exe!KiSystemServiceExitPico 2 489 111
 |    |    |- ntoskrnl.exe!PerfInfoLogSysCallEntry 1 387 533
 |    |    |    |- ntoskrnl.exe!EtwTraceKernelEvent 1 386 492
 |    |    |    |    |- ntoskrnl.exe!EtwpLogKernelEvent 1 384 792
 |    |    |    |    |    |- tedrdrv.sys!<PDB not found> 1 251 146
 |    |    |    |    |    |    |- ntoskrnl.exe!ExfReleaseRundownProtection 435 011
 |    |    |    |    |    |    |- ntoskrnl.exe!ExfAcquireRundownProtection 431 336
 |    |    |    |    |    |    |- tedrdrv.sys!<PDB not found> 343 058
 |    |    |    |    |    |    |- tedrdrv.sys!<PDB not found><itself> 41 649
 |    |    |- ntoskrnl.exe!PerfInfoLogSysCallExit 1 061 081
 |    |    |    |- ntoskrnl.exe!EtwTraceKernelEvent 1 059 181
 |    |    |    |    |- ntoskrnl.exe!EtwpLogKernelEvent 1 056 491
 |    |    |    |    |    |- tedrdrv.sys!<PDB not found> 916 549

a call to PerfInfoLogSysCallEntry and PerfInfoLogSysCallExit. The CPU consumption of SleepEx is not our problem if we run the numbers. Due to ETW Tracing the costs of SleepEx are 33 times higher than originally!

But that is still not the full story. Inside the ETW logger we find a call to tedrdrv.sys for which we do not have symbols. The vast majority of costs are coming from this driver which calls into ExfAcquire/ReleaseRundownProtection which is part of an API which was added to support Antivirus drivers to be able to safely update itself at runtime.

With ETWAnalyzer we can dump the module information for all dlls which have no symbols (= most likely non MS dlls/drivers) which are showing up in expensive stack traces.

EtwAnalyzer %f% -Dump CPU -TopN 5 -ShowTotal File -ProcessName !*idle* -SortBy CPU -TopNMethods 2 -MinMaxCPUms 100000 -Methods *.dll;*.sys -ShowModuleInfo    

ETWAnalyzer has an internal list of nearly all Antivirus Drivers. If a driver in the well known list is present in the output it is showing up as yellow text. We find with -ShowModuleInfo (-smi) that tedrdrv.sys is indeed the Palo Alto driver which has intercepted ETW traces, which causes high CPU consumption. The plot is thickening but we still do not have a smoking gun. When profiling is stopped my script always captures the output of xperf -Loggers to see which ETW loggers were running to check if profiling was running properly shortly before we stop our session. The output for NT Kernel Logger is the smoking gun:

	Logger Name           : NT Kernel Logger
	Logger Id             : ffff
	Logger Thread Id      : 0000000000003DC4
	Buffer Size           : 8192
	Maximum Buffers       : 118
	Minimum Buffers       : 96
	Number of Buffers     : 118
	Free Buffers          : 70
	Buffers Written       : 224156
	Events Lost           : 0
	Log Buffers Lost      : 0
	Real Time Buffers Lost: 0
	Flush Timer           : 0
	Age Limit             : 0
	Log File Mode         : Secure PersistOnHybridShutdown SystemLogger
	Maximum File Size     : 0
	Log Filename          : PaloNull
	Trace Flags           : SYSCALL
	PoolTagFilter         : *

Remember that the initial ETW Traces were fine until we did not get profiling events? Now we have found the reason. Palo Alto is enabling SysCall tracing by using the NT Kernel Logger ETW session which was already in use by us. Not nice.

Now we have all data to understand the high CPU issue

When multithreaded code is running which uses Spinlocks Sleep(0) is called on all cores in a tight loop. This is the way how Spinlocks work as explained earlier. Since Palo Alto wants to know which kernel methods all applications are calling they enable ETW tracing and hook inside the kernel the ETW trace call to get easy access to that information. Unfortunately the AV driver needs to access some protected kernel object via a “lock” with ExfAcquire/ReleaseRundownProtection. This “lock” is implemented as interlocked memory accesses which are ok if not many threads are accessing that memory lane. This assumption is no longer true in a heavily multithreaded application. The CPU costs of Sleep(0) are now >33 time higher which causes the server to go to 100% CPU consumption for unusual long times (minutes) which makes users wanting to reboot that machine. The server is not responding, but it miraculously recovers again if you patiently wait just long enough.

Conclusions

Antivirus solutions, multithreaded software and many core machines are a deadly mix for performance. The idea to intercept SysCalls might work decently on small machines, but if AV vendors try the same trick on server machines with multithreaded software things fall apart. I hope some AV vendor reads this post and change their mind about the SysCall tracing approach. Since consumer machines get more and more cores this problem will become bigger over the next years. Please stop intercepting SysCalls and let the machine do its work.

In the past AV did get only in the way when files were accessed or processes were started. That can be dealt with configuring the AV solutions to exclude certain directories/processes to limit its impact. But if compute bound workloads are affected in general, what should one do, except to uninstall that AV software? I have bad experiences with configuring AV solutions, because very often the overhead is still there, although the configuration UI says that all is turned off.

You can figure out the impact of AV with ETW but it takes quite some time and a lot of data to get to that point. For the average user it is impossible to find such issues, except if he can do some A/B testing with/without the AV solution. That works, but if the next AV update brings in new tracing capabilities you will have a hard time to figure that one out without ETW. It should not be so hard to quantify the impact of AV solutions given the number of times they are causing issues.

The loss of productivity due the high performance overhead of AV solutions is hard to quantify, but it is definitely not cheap. In my opinion all AV scanning solutions are trying hard to identify every threat, but fail to properly address the adverse performance effects caused by these additional scanning capabilities.

On the other hand all virus “vendors” will test with all currently used AV scanning solutions to get in undetected. In effect AV scanning solutions will be mostly able to detect past threats, but not the latest release of e.g. malware X. I wonder if it is cost effective to detect past threats with such a high overhead while new malware by definition is still able to get in.

If the scanning settings of AV solutions are tuned towards performance and not only for detection rate our fleet of > 1 billion of Windows computers would consume Gigawatts less of electrical energy (conservative estimate: >1 W of additional energy * 10^9 ). This is a global issue which is not on anyone’s agenda because it is so well hidden deep inside your computer which works but consumes more CPU than necessary. Many people talk about the negative environmental effects of Bitcoin mining, but no one talks about the environmental effects of AV scanning overhead.

I hope I have answered the mystery of a stuck server and raised new questions if it is really ok to turn on every AV scanning feature to stay safe. The tradeoff between computer safety and power consumption should be discussed more by security experts to arrive at decisions which are more balanced than it is currently the case. A wild idea would be to have a virtual Watt Meter for every wasted Watt consumed by AV scanners to make this cost explicit. You can only act on things you can measure. This is currently not the case.

Advertisement

4 thoughts on “War Story: How Antivirus solutions can bring a server down

  1. Hmm… are you saying that Palo Alto is actually hooking EtwpLogKernelEvent? ETW consumers are notified async and in the stack traces above it very much seems that Palo Alto’s driver is called in the context of Sleep. I am surprised patch guard does not punish them for this.

    I sure hope that you escalated the issue to them, with all the global warming, we really do not need servers to be running unnecessary hot.

    FWIW, I understand why they are doing what they are doing – most likely are after VirtualProtect and friends.

    Like

      • well, you have it installed on a server, thus you must have a license, thus there must be some support contract in place. Just escalate it as an issue, apparently it is making your server unusable. I do not think this qualifies as sec vulnerability.

        Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.