When Things Get Really Bad – NTFS File System Corruption War Story

The setup is a PC which runs software version 1.0. After an upgrade to 2.0 everything did work but after 1-2 weeks the machine would not start the software v2.0 anymore. When the issue was analyzed it turned out that a xml configuration file was starting with binary data which prevented v2.0 from starting up.

Good file

<?xml version="1.0" encoding="utf-8"?>
  <import resource="../../Config1.xml"/>    
  <import resource="../../Config2.xml"/>
  <import resource="../../Config3.xml"/>
  <import resource="../../Config4.xml"/>

Bad file

DCMPA30€×ÕÞ±# EÔ°ƒ3¡;k”€
0…\
*d*“B  ...
    <control name="MissileControl" ref="IMissileControl"/>
    <control name="PositionControl" ref="IPositionControl"/>
    <control name="ReservationNotify" ref="IReservationNotify"/>
</config>

The binary data did cover the first 4096 bytes of the xml file which was certainly not good. This raises the question if other files have been corrupted and if we can we trust that machine at all? That calls for a tool that scans the entire hard drive and calculates the checksum of all files on that machine which can then be compared to all other files from a good machine with identical setup. This is not a new problem so there should be some open source software out there. I found hashdeep which was developed by Jesse Kornblum an agent of some government agency. It was exactly the thing I did need. You can generate from a directory recursively a list of files with their hashes (audit mode) from a good machine and compare these hashes on another machine and print the missing/new/changed files as diff output.

The audit file generation and file verification are straightforward:

hashdeep -r dir  > /tmp/auditfile            # Generate the audit file
hashdeep -a -k /tmp/auditfile -r dir         # test the audit

The current version seems to be 4.4 and can be downloaded from Github. If you are curious here are the supported options:

C:\> hashdeep64.exe [OPTION]... [FILES]...
-c <alg1,[alg2]> - Compute hashes only. Defaults are MD5 and SHA-256
                   legal values: md5,sha1,sha256,tiger,whirlpool,
-p <size> - piecewise mode. Files are broken into blocks for hashing
-r        - recursive mode. All subdirectories are traversed
-d        - output in DFXML (Digital Forensics XML)
-k <file> - add a file of known hashes
-a        - audit mode. Validates FILES against known hashes. Requires -k
-m        - matching mode. Requires -k
-x        - negative matching mode. Requires -k
-w        - in -m mode, displays which known file was matched
-M and -X act like -m and -x, but display hashes of matching files
-e        - compute estimated time remaining for each file
-s        - silent mode. Suppress all error messages
-b        - prints only the bare name of files; all path information is omitted
-l        - print relative paths for filenames
-i/-I     - only process files smaller than the given threshold
-o        - only process certain types of files. See README/manpage
-v        - verbose mode. Use again to be more verbose
-d        - output in DFXML; -W FILE - write to FILE.
-j <num>  - use num threads (default 8)

After checking all files I found no other corruptions. This is really strange. Why did the xml file not become corrupt during the upgrade but 2 weeks later? The config file is read only and should be never written by anybody. It could be a simple glitch in some memory bank of the SSD but since the exact same error did happen on several machines we can rule out cosmic rays as source of our problem. But can radiation be a problem? Intel has tried it out by putting a SSD into the beam of a particle accelerator. When enough radiation in the game anything can happen but it will be happening at random locations.

Could it be a hard power off while the system was writing data to the disk? If the OS does everything right the SSD might be having not enough power to store the data. But that theory could also be ruled out because the SSD of that machine had a built in capacitor to ensure that all pending writes are safely written even during a hard power off event.

If you have thought of enough nice theories you need to get more evidence (read data). In that case checking the Windows Event Logs could help to find some clues. One first hint was that after the machine was bricked the application event log showed that Defrag has been run. Why on earth is Defrag running on a SSD? That should not be needed since we have no need to realign our data to optimize disk seek times. It turned out that Defrag was reenabled by accident during the update of v2.0.

The default settings were that it would run every day late in the evening. Windows is smart enough to not defrag a SSD. Instead it will trigger SSD Trim to let the SSD do its own reorg things. But that is not all. Although Defrag is not performing a typical defrag operation it still defragments directories and other NTFS tables. NTFS becomes dead slow when you store ~ 100K files in a single directory. Then you will hit hard limits of the NTFS file system structures. These internal lists need some cleanup in case there are many file add/remove operations happening in some hot directories which could lead after some time to a slow or broken file system. To prevent that Defrag is still doing some house keeping and moving some things around even on SSDs.

Since not all machines are running until 9 PM it also explains why after the software update the machine did break randomly within two weeks after the update. When defrag did run and the machine was shut down in the next morning the machine is turned on it would refuse to work due to the corrupted xml configuration file.

That is a nice finding but it still does not explain why Defrag would ruin our xml file. Lets check the system event log for interesting events. With interesting I mean messages like these

System Event Log Messages

Id	Source	Meaning	Example Message
6005	EventLog	System Start	The Event log service was started.
6006	EventLog	System Stop	The Event log service was stopped.
6008	EventLog	Hard Reboot	The previous system shutdown at xxxx AM on ‎dd/dd/dddd‎ was unexpected.
41	Microsoft-Windows-Kernel-Power	Hard Reboot	The system has rebooted without cleanly shutting down first. This error could be caused if the system stopped responding, crashed, or lost power unexpectedly.
109	Microsoft-Windows-Kernel-Power	Reboot initiated by OS	The kernel power manager has initiated a shutdown transition
131	Ntfs	NTFS Corruption	“The file system structure on volume C: cannot be corrected.”
130	Ntfs	NTFS Repair (chkdsk)	The file system structure on volume C: has now been repaired.

Application Event Log Messages

Id	Source	Meaning	Example Message
258	Microsoft-Windows-Defrag	Defrag Run	The storage optimizer successfully completed defragmentation on System (C:)
258	Microsoft-Windows-Defrag	Defrag Run	The storage optimizer successfully completed retrim on System (C:)
26228	Chkdsk	chkdsk Repair	Chkdsk was executed in verify mode on a volume snapshot.

The next clue was that after Defrag has been run the NTFS volume needed a repair.

Chkdsk        26228        None        "Chkdsk was executed in verify mode on a volume snapshot.

Checking file system on \Device\HarddiskVolume2

Volume label is System.

Examining 1 corruption record ...

Record 1 of 1: Bad index ""$I30"" in directory ""\Users\xxxx\AppData\Roaming\Microsoft\Windows\Start Menu\Programs\System Tools <0x4,0x15437>"" ... The multi-sector header signature for VCN 0x0 of index $I30

in file 0x15437 is incorrect.

44 43 4d 01 50 41 33 30 ?? ?? ?? ?? ?? ?? ?? ?? DCM.PA30........

corruption found.

That is a helpful message! After Defrag has been run NTFS is broken. This NTFS corruption is then fixed by chkdsk which is complaining about some start menu entry. Luckily chkdsk prints also the first bytes of the corrupted file which starts with DCM.PA30 which is the same as our corrupted xml file! I did extract the binary file contents of the broken xml file and put it in an extra file. Then I did run hashdeep.exe over it to check if this binary data is the file content of some other file. Since I have only the first 4K of the data I was not hopeful to find something. But lucky me I did find a file which did match. It looks like we did mix the contents of

c:\Windows\WinSxS\Manifests\amd64_windows-defender-management-onecore_…_none_….manifest

with our xml file and the start menu entry shortcut! The question remains what Windows Defender files have to do with our xml and the start menu entry. Lets check the System Event Log for interesting events:

Now we can replay the events that lead to the NTFS corruption

OS is installed
Device Guard is enabled (not shown in the logs)
Windows shuts itself down (“The kernel power manager has initiated a shutdown transition”)
Sometimes the system hangs and the person installing the machine presses the power button to perform a hard reset
- This caused the initial NTFS corruption
After boot NTFS detects that something is corrupt which is repaired although one message tells us that it cannot be repaired
1. Volume C … needs to be taken offline to perform a Full Chkdsk
2. A corruption was discovered in the file system structure on volume
3. The file system structure on volume C: cannot be corrected
4. The file system structure on volume C: has now been repaired
Now comes our part. After the repair was done automatically the software v1.0 is installed which writes data on an ill repaired NTFS volume and continues to run happily until …
v2.0 of the software is installed which enables defrag by accident again
The software version 2.0 runs happily it seems …
But when the machine did run at 9:00 PM in the evening defrag is executed which defragments NTFS tables which uses referential information from other tables which destroys the first 4K of our xml file and we end up with some Windows Defender file contents.
The next morning the machine would refuse to work
Problems ….

That was an interesting issue and as you can guess this analysis was not done by me alone but several people contributed to connect all the loose ends. Thank You! You know who you are.

What can we learn from this?

Log messages are really important
Keeping all of them is essential to find the root cause
- Overwriting the event log with myriads of messages is not the brightest idea
Luck: We got lucky that the System Event Log history included the initial OS installation

Hard Power Off

This is another case where transactions do not help to get an always consistent file system. What else can happen? Lets try it out by writing to some log files and then power off the machine while it is writing data to disk.

I used this simple C# program to write to several files in an alternating manner. Not sure if that is needed but it seems to help to force issues sooner

using System.Collections.Generic;
using System.IO;

class Program
{
    static void Main(string[] args)
    {
        List<Stream> streams = new List<Stream>();
        const int Files = 5;

        for (int i = 0; i < Files; i++)
        {
            string fileName = $"C:\\temp\\TestCorruption{i}.txt";
            FileStream stream = new FileStream(fileName, FileMode.OpenOrCreate, FileAccess.Write, FileShare.ReadWrite);
            streams.Add(stream);
        }

        while (true)
        {
            for (int i = 0; i < Files; i++)
            {
                using StreamWriter writer = new StreamWriter(streams[i], leaveOpen:true);
                writer.WriteLine("Hello this is string 1");
                writer.Flush();
                writer.Write("Hello this is string 2");
                writer.Flush();
            }
        }
    }
}

We are writing to 5 log files the string “Hello this is string1”, flush the data to enforce a WriteFile call to the OS so that our process has no pending buffered data, then another string and enforce WriteFile again. If the process crashes we get consistent data in our log file because the OS has already got the data and it will write at some time later the pending data to disk. But what will happen when we do a hard power off?

Lets look at our log file

Hello this is string 2Hello this is string1
Hello this is string 2Hello this is string 1
Hello this is string 2Hello this is string 1
Hello this is string 2Hello this is string 1
Hello this is string 2Hello this is string 1
Hello this is string 2Hello this is string 1
Hello this is string 2Hello this is string 1
Hello this is string 2Hello this isNULNULNULNULNULNULNULNULNULNUL....

Who did write the 0 bytes to our text file? We did write only ASCII data to our files! When WriteFile(“Buffered data) is executed and the OS wants to write that data to disk the OS needs to

Increase the file size in the directory
Write the buffered data over the now nulled cluster

Step 1 and 2 are related but a hard power off can lets 1 happen but not 2. This can result in a text file which contain larger amounts of 0 bytes at the end for no apparent reason. See https://stackoverflow.com/questions/10913632/streamwriter-writing-nul-characters-at-the-end-of-the-file-when-system-shutdown where someone with more NTFS know how than me answers that question:

That happens. When you append a file first its size is corrected in the directory (and that’s transactional in NTFS) and then the actual new data is written. There’s good chance that if you shut down the system you end up with a file appended with lots of null bytes because data writes are not transactional unlike metadata (file size) writes.

There’s no absolute solution to this problem.

answered Jun 6 ’12 at 11:53

sharptooth

156k

When a Cluster is allocated it is nulled out to ensure that no previous data can be there. While you append data to a file you can miss the last write although the file size already contains the updated value which makes the 0s visible in a text editor. That is a nasty issue if you expect that nobody will ever reset a machine. I am not aware of a way around that except that you need to be prepared that some text files might contain 0 bytes at the end due to a hard power off.

Was the file system corrupted this time again? What does the System event log tell us?

Critical    9/5/2020 5:40:52 PM    Microsoft-Windows-Kernel-Power    41    (63)    The system has rebooted without cleanly shutting down first. This error could be caused if the system stopped responding, crashed, or lost power unexpectedly.
Information    9/5/2020 5:40:52 PM    Microsoft-Windows-Ntfs    98    None    Volume \\?\Volume{12a882c8-5384-11e3-8255-806e6f6e6963} (\Device\HarddiskVolume2) is healthy.  No action is needed.

NTFS says everything is green. From a file integrity perspective everything did work, until power was cut. Your hard disk did miss the last write but hey: Who did guarantee you that this is an atomic operation?

Conclusions

Even impossible errors have an explanation if you are willing to dig deep. What was your nastiest error you did manage to find the root cause?

One thought on “When Things Get Really Bad – NTFS File System Corruption War Story”

Frank Hileman says:

February 17, 2022 at 7:09 pm

I ran into the same problem a few years ago. Since then, I try to write files atomically, instead of streaming contents continuously. I don’t know what happened to NTFS, but my recollection is, it was more robust in the past.

LikeLike

Alois Kraus

Performance is everything. But correctness comes first.

When Things Get Really Bad – NTFS File System Corruption War Story

System Event Log Messages

Application Event Log Messages

What can we learn from this?

Hard Power Off

Conclusions

One thought on “When Things Get Really Bad – NTFS File System Corruption War Story”

Leave a comment Cancel reply

System Event Log Messages

Application Event Log Messages

What can we learn from this?

Hard Power Off

Conclusions

Share this:

One thought on “When Things Get Really Bad – NTFS File System Corruption War Story”

Leave a comment Cancel reply