The latest Batman PC game had after the first release a bad reputation for being buggy and very slow. Due to some bad outsourcing and project management decisions the PC version did not meet the required quality gates. This quality and performance disaster did force Warner Bros to cancel the PC sales for three month until a much more improved version was released which did meet customer expectations with regards to performance. The outcome was that the unpatched PC version was sold for 50% off and I could get a very good game for a very low price. I did buy the unpatched version, did throw away the unpatched DVDs, and downloaded the “real” v1.0 from Steam with the accompanying product code.
After being aware of potentially bad performance in the game I kept an eye on it but I found not any rough spots anymore. The only thing that bugged me was when I switched away from the game to do something different and then switching back after ca. 30 minutes the first few seconds of the game graphics was hanging for several seconds. That sounds like a good reason to check performance with my favorite performance recording tool ETWController. Since it records also the screenshots I could easily track down in the profiling data what was happening while the game was hanging.
Since I get also a HTML report I can exactly see where in the timeline I have continued the game, although I do not get screenshots of the in game graphics.
After I stopped the recording I can directly open the ETL file and navigate the point of slowness in the game by selecting the time region after Screenshot 5.
When the game hangs it is usually CPU, Disk, memory or the network. So lets check what is going on.
By looking at the CPU consumption we see that the game has a several seconds near zero CPU activity between 10-13s which correlates with the observed hang.
While checking disk IO during the low CPU activity phase I find a lot of page file related IO. When the game resumes it needs to access a lot of its memory which was previously written into the page file this results in a growth of the games working set.
In the graph above I see that I have two page files (on drive C and D) where I wait 3,6s for 55 MB from drive D and 500ms for 103MB from drive C. That makes sense since my C drive is a SSD
whereas my D drive is a regular hard drive
That explains why reading from the page file on the D drive is 7 times slower although only 50% of the data is read from it! Random read access is the worst case scenario for a regular hard disk whereas my SSD can cope with that read pattern easily.
I do not even remember that I have enabled on D a page file but I think I did it some years ago where SSDs had still write durability issues just to be on the safe side. As it turns out it was a bad decision and as an easy fix I disabled my page file on drive D. The sudden hangs when switching back to the game were gone after that change.
Whose Fault Is It?
That was the easy part. But you can go much deeper to really see where the latencies are coming from. Hard faults which involve disk IO are served in the kernel by the method ntoskrnl.exe!MiIssueHardFault which takes care to wait for the pending disc IO of the page file. If we look at which processes did wait for page fault IOs and visualize the total CPU and wait times we get this timeline
- We can see that the hard faults nicely correlate with the Disk activity in the graph below.
- We find that the Arkham Knight game waits only for 90ms for hard page faults (see Waits(us)_Sum column).
- But the System process waits for over 10s (summed across all threads) for page in activity.
If we look who is causing this page in activity we find DirectX in the Windows kernel to cause quite some IO due to page faults.
ntoskrnl.exe!MiIssueHardFault
ntoskrnl.exe!MmAccessFault
|- ntoskrnl.exe!KiPageFault
|- ntoskrnl.exe!MiProbeLeafFrame
| ntoskrnl.exe!MmProbeAndLockPages
| |- dxgmms2.sys!VIDMM_MDL_RANGE::Lock
| | dxgmms2.sys!VIDMM_RECYCLE_HEAP_PHYSICAL_VIEW::LockRange
| | dxgmms2.sys!VIDMM_RECYCLE_RANGE::Lock
| | dxgmms2.sys!VIDMM_RECYCLE_MULTIRANGE::Lock
| | dxgmms2.sys!VIDMM_RECYCLE_HEAP_MGR::ProbeAndLockAllocation
| | dxgmms2.sys!VIDMM_GLOBAL::ProbeAndLockAllocation
| | |- dxgmms2.sys!VIDMM_SYSMEM_SEGMENT::LockAllocationRange
| | | |- dxgmms2.sys!VIDMM_MEMORY_SEGMENT::CommitResource
| | | | dxgmms2.sys!VIDMM_GLOBAL::PageInOneAllocation
| | | | dxgmms2.sys!VIDMM_GLOBAL::ProcessDeferredCommand
| | | | dxgmms2.sys!VIDMM_WORKER_THREAD::Run
| | | | ntoskrnl.exe!PspSystemThreadStartup
| | | | ntoskrnl.exe!KiStartSystemThread
It is not even the game directly but DirectX needs to page in its buffers from previously paged out memory which takes several seconds from my slow spinning hard disk. It is possible to track this further down into the wait calls of the game but that should be proof enough to blame disk IO of DirectX buffers for the observed sluggish gaming experience.
Conclusions
It was not Batman’s fault that he was slow but my misconfigured secondary page file on my spinning hard disk. Removing the page file from the spinning disk did solve my performance problem.
Another observation is that Windows seems to write data out to the first responding device which hosts a page file. That is nice to get as much data as possible in an efficient manner into the page file but the real test comes when the data is read back again. I think the current page out algorithm in Windows is inefficient when it can choose from several devices to write its page data to it. It should balance the written data with respect to the random access servicing time of each device. In effect SSDs with its very fast response times should be favored over slow hard disks. This makes it possible to have several page files on devices with vastly different response times but Windows should prefer to write most of the data onto the device with the fastest random access time to make hard page faults pause time as short as possible.
This once again shows that assumptions about bad performance based on former evidence (the game is slow …) will lead to wrong conclusions about current performance issues.
Performance problems are much like quantum mechanics: The problem does not exist until you have measured it. Some people come up with smart tests to verify this or that assumption. But it is usually easier to use a profiling tool to get down to the root cause with a more structured approach.