How a chkdsk can turn into a umountable_boot_volume Bugcheck

It started as a routine health check. I was managing a Windows Server Core domain controller in Azure with over 1 million files on the volume - a busy environment handling authentication for our entire domain. The server had been showing some file system inconsistencies, and I was concerned about potential MFT (Master File Table) corruption.

Like any administrator, I decided to run a file system check. What could go wrong with a simple chkdsk command, right?

Famous last words.

The Fatal Command

Here's the command that changed my day from routine maintenance to disaster recovery:

chkdsk c: /f

Simple. Clean. Destructive.

The /f flag forces the volume to be checked and fixes any errors found. On a running system, Windows asks if you want to schedule the check for the next restart since it can't lock the system volume while Windows is running.

I answered "Y" for yes, thinking I was being proactive about system maintenance.

The First Warning Signs

After the reboot, I noticed something immediately wrong. Instead of the usual boot process, the server was taking an unusually long time to come online. When I finally got console access, I was greeted with filesystem errors and "handle invalid" messages.

The server had booted, services appeared to be running, but something was fundamentally broken. I couldn't delete files, couldn't run most commands, and kept getting this cryptic error:

Error: out of disk space

This was baffling because I had 60GB of free space yet refused to copy even a 300 KB file — the system returned a “no disk space” error. Simultaneously, DFS Replication (DFSR) was broken, SYSVOL was stuck in initial synchronization, and deletion of files was impossible. The DFSR event log reported:

The DFS Replication service failed to recover from an internal database error on volume C:.
Error: 9214 (Internal database error (-1086))

This suggested potential MFT corruption or NTFS metadata exhaustion due to an excessively high number of files. The following steps detail how I investigated and validated the issue.

Checking Disk Space and File Count

To verify the disk space:

Get-PSDrive

Although free space was available, I checked the number of files on the volume:

(Get-ChildItem -Recurse -Path C:\ -Force -ErrorAction SilentlyContinue 
| Measure-Object).Count

This command revealed over 1 million files, significantly above normal expectations for a domain controller.

Confirming File System-Level Failures

Although I had administrative rights, even elevated PowerShell sessions couldn’t delete files. To rule out permissions, I attempted to run as SYSTEM:

psexec -s cmd.exe

Still, deletions failed, pointing to an underlying NTFS metadata issue. This is consistent with Master File Table exhaustion or corruption of transactional NTFS metadata.

Scheduling Filesystem Check

To address the issue, I scheduled a disk check with repair on next reboot:

chkdsk C: /F

After confirming the volume was in use, I accepted the prompt to run at reboot with Y and restarted the server.

With over 1 million files on the volume, several critical things happened:

1. Active Directory Database Vulnerability

The ntds.dit file (Active Directory database) was actively being written to when chkdsk started its repair process. Domain controllers constantly write to disk for:

  • Authentication requests
  • Replication traffic
  • Transaction logs
  • Registry updates

2. MFT Corruption During Repair

The Master File Table was likely corrupted during the chkdsk repair process. With 1M+ files, the MFT verification and repair phases can take hours, during which the filesystem is in an inconsistent state.

3. Transaction Log Corruption

Active Directory transaction logs (*.log files) are critical for database consistency. If these were corrupted or truncated during file system repair, it could render the entire AD database unusable.

The File system Lockdown

After the initial chkdsk completed and the server rebooted, I discovered that chkdsk was still running in the background. Here's how I confirmed this:

# Check for running chkdsk processes
tasklist | findstr chkdsk

# Check system event log for ongoing chkdsk activity  
wevtutil qe System /c:10 /rd:true /f:text | findstr -i chkdsk

# Check NTFS volume information
fsutil fsinfo ntfsinfo c:

The NTFS info showed the volume structure was intact, but the "handle invalid" errors indicated that chkdsk had exclusive locks on the file system. The domain controller was essentially frozen - services were running but couldn't properly access files.

The Point of No Return

After a failed "chkdsk C: /f" operation, issuing a command to mark the disk as dirty forces the system to schedule an automatic disk check at the next reboot. This ensures that the volume is thoroughly scanned and any potential file system issues are addressed during system startup, when exclusive access to the disk is more likely to be available.

fsutil dirty set c:

Then I rebooted, hoping an offline chkdsk would resolve the corruption.

Instead, I was greeted with the dreaded "Unmountable Boot Volume" blue screen. The offline chkdsk had likely encountered corruption from the initial online chkdsk that it couldn't repair, rendering the boot volume unmountable.

The Aftermath: System Failure

The blue screen told the whole story. The chkdsk process had found critical file system corruption that it couldn't repair. With over 1 million files and complex MFT structures, the repair process had likely:

  • Damaged critical boot files
  • Corrupted the Active Directory database beyond repair
  • Destroyed essential system registry hives
  • Marked critical sectors as bad when they weren't recoverable

The domain controller was completely dead.

No Disk Failure: Lets Identify File and Folder Growth

To locate directories responsible for file bloat, I used these PowerShell commands:

Top 20 largest files:

Get-ChildItem -Path C:\ -Recurse -File -ErrorAction SilentlyContinue |
    Sort-Object Length -Descending |
    Select-Object FullName, @{Name="SizeMB";Expression={[math]::Round($_.Length/1MB,2)}} 
-First 20

Largest folders by total file size:

Get-ChildItem C:\ -Directory | ForEach-Object {
    $folder = $_.FullName
    $size = (Get-ChildItem -Path $folder -Recurse -File -ErrorAction SilentlyContinue 
    | Measure-Object -Property Length -Sum).Sum
    [PSCustomObject]@{
        Folder = $folder
        SizeMB = [math]::Round($size / 1MB, 2)
    }
} | Sort-Object SizeMB -Descending

Folder sizes with file counts over 1 GB:

Get-ChildItem C:\ -Directory | ForEach-Object {
    $folder = $_.FullName
    $files = Get-ChildItem -Path $folder -Recurse -File -ErrorAction SilentlyContinue
    $size = ($files | Measure-Object -Property Length -Sum).Sum
    $count = $files.Count
    if ($size -gt 1GB) {
        [PSCustomObject]@{
            Folder = $folder
            SizeGB = [math]::Round($size / 1GB, 2)
            FileCount = $count
        }
    }
} | Sort-Object SizeGB -Descending

These scans identified directories with excessive file counts, particularly application logs and DFSR staging folders.

Lessons Learned 

1. Be Extremely Cautious with chkdsk on Production Domain Controllers

While chkdsk isn't absolutely forbidden on domain controllers, it requires extreme caution. Microsoft's official documentation shows that chkdsk can be run, but the risks are significant when Active Directory databases and logs are involved. The safer approach is offline maintenance during planned downtime.

2. Understand Your File Volume

With 1M+ files, any file system operation becomes exponentially more complex and risky. The MFT size was over 1GB - corruption at this scale is often unrecoverable.

3. Always Have Multiple Domain Controllers

Thankfully, this wasn't our only domain controller. The domain continued functioning while I dealt with this disaster.

4. Backup Before Maintenance

I should have taken a VM snapshot before running any file system repairs. In Azure, this is as simple as creating a restore point.

What would do different again?

If I had to do this again, here's what I would do instead:

1. Check Current Status First

# Check if volume is dirty
fsutil dirty query c:

# Check filesystem without repairs (read-only)
chkdsk c: /scan

2. Plan for Offline Maintenance

Based on Microsoft's documentation, the safest approach for domain controllers is:

  • Schedule planned maintenance windows
  • Stop the NTDS service: net stop ntds
  • Run chkdsk with the DC offline
  • Restart NTDS service: net start ntds

3. Use Read-Only Analysis First

# Check for errors without fixing
chkdsk c: /v /scan

Conclusion

A simple chkdsk c: /f command turned into a complete domain controller rebuild. The combination of high file count, active database operations, and file system repair created a perfect storm of corruption.

Previous Post Next Post

نموذج الاتصال