Disclaimer: I do not accept responsibility for any issues arising from scripts being run without adequate understanding. It is the user's responsibility to review and assess any code before execution. More information

How NTLM Authentication Can Randomly Destroy Half Your Infrastructure: A CIS Level 1 Survival Guide


CIS Level 1 security baseline is supposed to be "minimum" security standard (think securing your device with a feather duster) that every organization should have, however why did this cause such a problem for the Citrix environment, not all users, as users some were logging in as they should be. Random users getting authentication prompts while their colleagues worked normally.

This revealed a critical issue that exists in most enterprise environments: NTLM authentication can randomly switch between versions, and you won't know until something breaks.

Understanding the Hidden NTLM Behavior

Here's what most administrators don't realize about Windows authentication:

  1. Kerberos fails more than you think - When it does, Windows automatically falls back to NTLM
  2. NTLM version is NOT negotiated - Unlike the Kerberos-to-NTLM fallback (which uses SPNEGO), NTLMv1 vs NTLMv2 is determined by configuration, defaults, and context
  3. The same system can randomly use different versions - This is the killer

When Kerberos fails (missing SPNs, IP-based access, DNS issues), systems fall back to NTLM. But here's the critical part: whether they use NTLMv1 or NTLMv2 can vary randomly based on:

  • Missing Extended Session Security flags
  • Absent or malformed NegotiateFlags
  • Application-specific defaults
  • Legacy compatibility modes

What Happened with Citrix FAS

Citrix FAS (Federated Authentication Service) servers, despite being modern certificate-based authentication systems, were randomly choosing between NTLMv1 and NTLMv2 because Kerberos was never properly configured. When the CIS L1 baseline set Level 5 ("Refuse LM & NTLM"), every NTLMv1 attempt immediately failed. No graceful degradation, no retry with v2 - just failure.

This explained the 50/50 split: authentications randomly using NTLMv2 continued working, while those using NTLMv1 failed.

How to Detect This Before It Breaks

After the painful rollback, I developed a comprehensive monitoring strategy. Here's exactly what to look for:

Step 1: Enable NTLM Auditing (Do This First)

Before making any changes, enable auditing through Group Policy:

On Domain Controllers:

  • Computer Configuration → Windows Settings → Security Settings → Local Policies → Security Options
  • Set "Network Security: Restrict NTLM: Audit NTLM authentication in this domain" to "Enable all"

On Domain Policy:

  • "Network Security: Restrict NTLM: Audit Incoming NTLM Traffic" → "Enable auditing for domain accounts"
  • "Network security: Restrict NTLM: Outgoing NTLM traffic to remote servers" → "Audit all"

This creates logs in: Applications and Services Logs → Microsoft → Windows → NTLM

Step 2: Hunt for the Epiphany - Event ID 4624

The critical indicator is Event ID 4624 in the Security log. Look for the "Package Name (NTLM only)" field:

  • "NTLM V1" = Will fail when CIS L1 is applied
  • "NTLM V2" = Will continue working
# Find all NTLMv1 authentication attempts in the last 7 days
$StartTime = (Get-Date).AddDays(-7)
$NTLMv1Events = Get-WinEvent -FilterHashtable @{
    LogName='Security'
    ID=4624
    StartTime=$StartTime
} -ErrorAction SilentlyContinue | Where-Object {$_.Message -like "*NTLM V1*"}

if ($NTLMv1Events) {
    Write-Host "WARNING: Found $($NTLMv1Events.Count) NTLMv1 authentications!" -ForegroundColor Red
    $NTLMv1Events | Select-Object TimeCreated, 
        @{Name='Account';Expression={
            if ($_.Message -match 'Account Name:\s+(\S+)') { $Matches[1] }
        }},
        @{Name='Workstation';Expression={
            if ($_.Message -match 'Workstation Name:\s+(\S+)') { $Matches[1] }
        }},
        MachineName | Format-Table -AutoSize
} else {
    Write-Host "No NTLMv1 authentications found - Safe to proceed" -ForegroundColor Green
}

Step 3: Check for Kerberos Failures - Event ID 4776

Event ID 4776 shows when NTLM is used instead of Kerberos. High volumes indicate Kerberos isn't working properly:

# Check NTLM fallback frequency
$NTLMFallbacks = Get-WinEvent -FilterHashtable @{
    LogName='Security'
    ID=4776
    StartTime=(Get-Date).AddHours(-24)
} -ErrorAction SilentlyContinue

Write-Host "Found $($NTLMFallbacks.Count) NTLM authentications in last 24 hours"
if ($NTLMFallbacks.Count -gt 100) {
    Write-Host "WARNING: High NTLM usage indicates Kerberos problems!" -ForegroundColor Yellow
}

Step 4: Look for the Random Version Problem

This is crucial - find systems that randomly use both NTLMv1 and NTLMv2:

# Find systems with inconsistent NTLM versions
$Events = Get-WinEvent -FilterHashtable @{
    LogName='Security'
    ID=4624
    StartTime=(Get-Date).AddDays(-1)
} -ErrorAction SilentlyContinue | Where-Object {$_.Message -like "*NTLM V*"}

$Systems = @{}
foreach ($Event in $Events) {
    if ($Event.Message -match 'Workstation Name:\s+(\S+)' -and 
        $Event.Message -match 'Package Name.*:\s*(NTLM V\d)') {
        $Workstation = $Matches[1]
        $Version = $Matches[2]
        
        if (-not $Systems.ContainsKey($Workstation)) {
            $Systems[$Workstation] = @()
        }
        $Systems[$Workstation] += $Version
    }
}

Write-Host "`nSystems with inconsistent NTLM versions:" -ForegroundColor Cyan
foreach ($System in $Systems.Keys) {
    $Versions = $Systems[$System] | Select-Object -Unique
    if ($Versions.Count -gt 1) {
        Write-Host "$System uses: $($Versions -join ', ') - THIS WILL CAUSE FAILURES!" -ForegroundColor Red
    }
}

Step 5: Monitor Performance Counters for Early Warning

The Netlogon performance counters reveal if NTLM is already struggling:

# Monitor NTLM authentication bottlenecks
$counters = @(
    '\Netlogon(*)\Semaphore Waiters',
    '\Netlogon(*)\Semaphore Holders', 
    '\Netlogon(*)\Semaphore Timeouts',
    '\Netlogon(*)\Average Semaphore Hold Time'
)

Write-Host "`nChecking NTLM Performance Health:" -ForegroundColor Cyan
foreach ($counter in $counters) {
    try {
        $result = Get-Counter $counter -ErrorAction SilentlyContinue
        foreach ($sample in $result.CounterSamples) {
            $value = [math]::Round($sample.CookedValue, 2)
            
            # Evaluate health
            $status = "OK"
            $color = "Green"
            
            if ($sample.Path -like "*Waiters*" -and $value -gt 0) {
                $status = "WARNING - Authentication queue building!"
                $color = "Yellow"
            }
            if ($sample.Path -like "*Timeouts*" -and $value -gt 0) {
                $status = "CRITICAL - Authentications failing!"
                $color = "Red"
            }
            if ($sample.Path -like "*Hold Time*" -and $value -gt 5000) {
                $status = "WARNING - Slow authentication!"
                $color = "Yellow"
            }
            
            Write-Host "$($sample.Path): $value - $status" -ForegroundColor $color
        }
    } catch {
        Write-Host "Counter $counter not available" -ForegroundColor Gray
    }
}

Key Performance Counter Meanings:

  • Semaphore Waiters > 0: Authentication requests are queuing (bottleneck)
  • Semaphore Holders at max: All authentication threads busy
  • Semaphore Timeouts > 0: Authentications are failing due to timeout
  • Average Hold Time > 5 seconds: Authentication is too slow

Step 6: Check System Event Log for NTLM Delays

Events 5816-5819 indicate NTLM authentication problems:

# Check for NTLM delay/failure events
$DelayEvents = Get-WinEvent -FilterHashtable @{
    LogName='System'
    ID=5816,5817,5818,5819
    StartTime=(Get-Date).AddDays(-7)
} -ErrorAction SilentlyContinue

if ($DelayEvents) {
    Write-Host "WARNING: Found $($DelayEvents.Count) NTLM delay/failure events" -ForegroundColor Red
    Write-Host "Event 5816: Authentication failures"
    Write-Host "Event 5818: Authentication delays exceeding threshold"
}

Step 7: Complete Pre-Flight Check

Run this comprehensive check before applying any NTLM restrictions:

# Complete NTLM Health Check Script
function Test-NTLMReadiness {
    param([int]$DaysToCheck = 7)
    
    Write-Host "=== NTLM Readiness Assessment ===" -ForegroundColor Cyan
    $Ready = $true
    
    # Check for NTLMv1 usage
    Write-Host "`nChecking for NTLMv1 usage..." -ForegroundColor Yellow
    $NTLMv1Count = (Get-WinEvent -FilterHashtable @{
        LogName='Security'
        ID=4624
        StartTime=(Get-Date).AddDays(-$DaysToCheck)
    } -ErrorAction SilentlyContinue | Where-Object {$_.Message -like "*NTLM V1*"}).Count
    
    if ($NTLMv1Count -gt 0) {
        Write-Host "   FAILED: Found $NTLMv1Count NTLMv1 authentications" -ForegroundColor Red
        $Ready = $false
    } else {
        Write-Host "   PASSED: No NTLMv1 usage detected" -ForegroundColor Green
    }
    
    # Check Kerberos health
    Write-Host "`nChecking Kerberos vs NTLM usage..." -ForegroundColor Yellow
    $KerbEvents = (Get-WinEvent -FilterHashtable @{
        LogName='Security'
        ID=4768,4769
        StartTime=(Get-Date).AddHours(-1)
    } -ErrorAction SilentlyContinue).Count
    
    $NTLMEvents = (Get-WinEvent -FilterHashtable @{
        LogName='Security'
        ID=4776
        StartTime=(Get-Date).AddHours(-1)
    } -ErrorAction SilentlyContinue).Count
    
    if ($NTLMEvents -gt $KerbEvents) {
        Write-Host "   WARNING: More NTLM than Kerberos (NTLM: $NTLMEvents, Kerberos: $KerbEvents)" -ForegroundColor Yellow
    } else {
        Write-Host "   PASSED: Kerberos is primary authentication method" -ForegroundColor Green
    }
    
    # Check performance counters
    Write-Host "`nChecking NTLM performance health..." -ForegroundColor Yellow
    $Waiters = (Get-Counter '\Netlogon(*)\Semaphore Waiters' -ErrorAction SilentlyContinue).CounterSamples.CookedValue | Measure-Object -Maximum
    $Timeouts = (Get-Counter '\Netlogon(*)\Semaphore Timeouts' -ErrorAction SilentlyContinue).CounterSamples.CookedValue | Measure-Object -Sum
    
    if ($Waiters.Maximum -gt 0 -or $Timeouts.Sum -gt 0) {
        Write-Host "   FAILED: NTLM bottlenecks detected" -ForegroundColor Red
        $Ready = $false
    } else {
        Write-Host "   PASSED: No NTLM bottlenecks detected" -ForegroundColor Green
    }
    
    # Final verdict
    Write-Host "`n=== ASSESSMENT COMPLETE ===" -ForegroundColor Cyan
    if ($Ready) {
        Write-Host "RESULT: Environment ready for CIS Level 1 baseline" -ForegroundColor Green
    } else {
        Write-Host "RESULT: DO NOT APPLY CIS Level 1 - Critical issues found" -ForegroundColor Red
        Write-Host "`nRequired fixes:"
        Write-Host "1. Identify all systems using NTLMv1"
        Write-Host "2. Fix Kerberos configuration (SPNs, DNS)"
        Write-Host "3. Review inconsistent NTLM version usage"
        Write-Host "4. Re-run assessment after fixes"
    }
    
    return $Ready
}

# Run the assessment
Test-NTLMReadiness -DaysToCheck 7

What You're Looking For - The Critical Indicators

Before applying CIS L1, you need:

  1. ZERO instances of "NTLM V1" in Event 4624 for at least 7 days
  2. Minimal Event 4776 (shows Kerberos failures with NTLM fallback)
  3. No Events 5816-5819 (NTLM delays/failures)
  4. No Semaphore Waiters or Timeouts in performance counters
  5. No systems showing both "NTLM V1" and "NTLM V2" (the randomness problem)

If you see ANY "NTLM V1" in your logs, those authentications WILL fail when you apply "Refuse LM & NTLM". There's no fallback, no retry - they just stop working.

Key Technical Details

Why NTLM Falls Back:

  • When clients retry without Extended Session Security, they lack NegotiateFlags
  • The server must forward the request to the DC with whatever flags it received
  • The DC makes the decision based on its LmCompatibilityLevel setting
  • If the DC is at Level 4 or below, it may accept NTLMv1

The MaxConcurrentApi Factor:

  • Default is 1 for workstations, 10 for servers/DCs (Windows 2012+)
  • Controls how many concurrent NTLM authentications can process
  • When exceeded, requests queue (Semaphore Waiters)
  • Eventually timeout if queue grows too long

Setting Warning Thresholds:

HKLM\SYSTEM\CurrentControlSet\Services\Netlogon\Parameters
WarningEventThreshold (DWORD) = 5000 (5 seconds in milliseconds)

Conclusion

Before applying CIS Level 1 or any NTLM restrictions, run the assessment script for at least 7 days. If you see any NTLMv1 usage, inconsistent versions, or performance bottlenecks, fix those first. The "minimum security baseline" will expose every authentication weakness in your environment.

This isn't just about Citrix - it affects Exchange, SQL Server, file shares, web applications, RDP, and any system using Windows Integrated Authentication. The random NTLM version behavior is a time bomb waiting in most environments.

Remember: In Windows authentication, what seems random usually isn't - it's just poorly documented default behavior meeting years of accumulated technical debt.

Previous Post Next Post

نموذج الاتصال