CIS Level 1 security baseline is supposed to be "minimum" security standard (think securing your device with a feather duster) that every organization should have, however why did this cause such a problem for the Citrix environment, not all users, as users some were logging in as they should be. Random users getting authentication prompts while their colleagues worked normally.
This revealed a critical issue that exists in most enterprise environments: NTLM authentication can randomly switch between versions, and you won't know until something breaks.
Understanding the Hidden NTLM Behavior
Here's what most administrators don't realize about Windows authentication:
- Kerberos fails more than you think - When it does, Windows automatically falls back to NTLM
- NTLM version is NOT negotiated - Unlike the Kerberos-to-NTLM fallback (which uses SPNEGO), NTLMv1 vs NTLMv2 is determined by configuration, defaults, and context
- The same system can randomly use different versions - This is the killer
When Kerberos fails (missing SPNs, IP-based access, DNS issues), systems fall back to NTLM. But here's the critical part: whether they use NTLMv1 or NTLMv2 can vary randomly based on:
- Missing Extended Session Security flags
- Absent or malformed NegotiateFlags
- Application-specific defaults
- Legacy compatibility modes
What Happened with Citrix FAS
Citrix FAS (Federated Authentication Service) servers, despite being modern certificate-based authentication systems, were randomly choosing between NTLMv1 and NTLMv2 because Kerberos was never properly configured. When the CIS L1 baseline set Level 5 ("Refuse LM & NTLM"), every NTLMv1 attempt immediately failed. No graceful degradation, no retry with v2 - just failure.
This explained the 50/50 split: authentications randomly using NTLMv2 continued working, while those using NTLMv1 failed.
How to Detect This Before It Breaks
After the painful rollback, I developed a comprehensive monitoring strategy. Here's exactly what to look for:
Step 1: Enable NTLM Auditing (Do This First)
Before making any changes, enable auditing through Group Policy:
On Domain Controllers:
- Computer Configuration → Windows Settings → Security Settings → Local Policies → Security Options
- Set "Network Security: Restrict NTLM: Audit NTLM authentication in this domain" to "Enable all"
On Domain Policy:
- "Network Security: Restrict NTLM: Audit Incoming NTLM Traffic" → "Enable auditing for domain accounts"
- "Network security: Restrict NTLM: Outgoing NTLM traffic to remote servers" → "Audit all"
This creates logs in: Applications and Services Logs → Microsoft → Windows → NTLM
Step 2: Hunt for the Epiphany - Event ID 4624
The critical indicator is Event ID 4624 in the Security log. Look for the "Package Name (NTLM only)" field:
- "NTLM V1" = Will fail when CIS L1 is applied
- "NTLM V2" = Will continue working
# Find all NTLMv1 authentication attempts in the last 7 days
$StartTime = (Get-Date).AddDays(-7)
$NTLMv1Events = Get-WinEvent -FilterHashtable @{
LogName='Security'
ID=4624
StartTime=$StartTime
} -ErrorAction SilentlyContinue | Where-Object {$_.Message -like "*NTLM V1*"}
if ($NTLMv1Events) {
Write-Host "WARNING: Found $($NTLMv1Events.Count) NTLMv1 authentications!" -ForegroundColor Red
$NTLMv1Events | Select-Object TimeCreated,
@{Name='Account';Expression={
if ($_.Message -match 'Account Name:\s+(\S+)') { $Matches[1] }
}},
@{Name='Workstation';Expression={
if ($_.Message -match 'Workstation Name:\s+(\S+)') { $Matches[1] }
}},
MachineName | Format-Table -AutoSize
} else {
Write-Host "No NTLMv1 authentications found - Safe to proceed" -ForegroundColor Green
}
Step 3: Check for Kerberos Failures - Event ID 4776
Event ID 4776 shows when NTLM is used instead of Kerberos. High volumes indicate Kerberos isn't working properly:
# Check NTLM fallback frequency
$NTLMFallbacks = Get-WinEvent -FilterHashtable @{
LogName='Security'
ID=4776
StartTime=(Get-Date).AddHours(-24)
} -ErrorAction SilentlyContinue
Write-Host "Found $($NTLMFallbacks.Count) NTLM authentications in last 24 hours"
if ($NTLMFallbacks.Count -gt 100) {
Write-Host "WARNING: High NTLM usage indicates Kerberos problems!" -ForegroundColor Yellow
}
Step 4: Look for the Random Version Problem
This is crucial - find systems that randomly use both NTLMv1 and NTLMv2:
# Find systems with inconsistent NTLM versions
$Events = Get-WinEvent -FilterHashtable @{
LogName='Security'
ID=4624
StartTime=(Get-Date).AddDays(-1)
} -ErrorAction SilentlyContinue | Where-Object {$_.Message -like "*NTLM V*"}
$Systems = @{}
foreach ($Event in $Events) {
if ($Event.Message -match 'Workstation Name:\s+(\S+)' -and
$Event.Message -match 'Package Name.*:\s*(NTLM V\d)') {
$Workstation = $Matches[1]
$Version = $Matches[2]
if (-not $Systems.ContainsKey($Workstation)) {
$Systems[$Workstation] = @()
}
$Systems[$Workstation] += $Version
}
}
Write-Host "`nSystems with inconsistent NTLM versions:" -ForegroundColor Cyan
foreach ($System in $Systems.Keys) {
$Versions = $Systems[$System] | Select-Object -Unique
if ($Versions.Count -gt 1) {
Write-Host "$System uses: $($Versions -join ', ') - THIS WILL CAUSE FAILURES!" -ForegroundColor Red
}
}
Step 5: Monitor Performance Counters for Early Warning
The Netlogon performance counters reveal if NTLM is already struggling:
# Monitor NTLM authentication bottlenecks
$counters = @(
'\Netlogon(*)\Semaphore Waiters',
'\Netlogon(*)\Semaphore Holders',
'\Netlogon(*)\Semaphore Timeouts',
'\Netlogon(*)\Average Semaphore Hold Time'
)
Write-Host "`nChecking NTLM Performance Health:" -ForegroundColor Cyan
foreach ($counter in $counters) {
try {
$result = Get-Counter $counter -ErrorAction SilentlyContinue
foreach ($sample in $result.CounterSamples) {
$value = [math]::Round($sample.CookedValue, 2)
# Evaluate health
$status = "OK"
$color = "Green"
if ($sample.Path -like "*Waiters*" -and $value -gt 0) {
$status = "WARNING - Authentication queue building!"
$color = "Yellow"
}
if ($sample.Path -like "*Timeouts*" -and $value -gt 0) {
$status = "CRITICAL - Authentications failing!"
$color = "Red"
}
if ($sample.Path -like "*Hold Time*" -and $value -gt 5000) {
$status = "WARNING - Slow authentication!"
$color = "Yellow"
}
Write-Host "$($sample.Path): $value - $status" -ForegroundColor $color
}
} catch {
Write-Host "Counter $counter not available" -ForegroundColor Gray
}
}
Key Performance Counter Meanings:
- Semaphore Waiters > 0: Authentication requests are queuing (bottleneck)
- Semaphore Holders at max: All authentication threads busy
- Semaphore Timeouts > 0: Authentications are failing due to timeout
- Average Hold Time > 5 seconds: Authentication is too slow
Step 6: Check System Event Log for NTLM Delays
Events 5816-5819 indicate NTLM authentication problems:
# Check for NTLM delay/failure events
$DelayEvents = Get-WinEvent -FilterHashtable @{
LogName='System'
ID=5816,5817,5818,5819
StartTime=(Get-Date).AddDays(-7)
} -ErrorAction SilentlyContinue
if ($DelayEvents) {
Write-Host "WARNING: Found $($DelayEvents.Count) NTLM delay/failure events" -ForegroundColor Red
Write-Host "Event 5816: Authentication failures"
Write-Host "Event 5818: Authentication delays exceeding threshold"
}
Step 7: Complete Pre-Flight Check
Run this comprehensive check before applying any NTLM restrictions:
# Complete NTLM Health Check Script
function Test-NTLMReadiness {
param([int]$DaysToCheck = 7)
Write-Host "=== NTLM Readiness Assessment ===" -ForegroundColor Cyan
$Ready = $true
# Check for NTLMv1 usage
Write-Host "`nChecking for NTLMv1 usage..." -ForegroundColor Yellow
$NTLMv1Count = (Get-WinEvent -FilterHashtable @{
LogName='Security'
ID=4624
StartTime=(Get-Date).AddDays(-$DaysToCheck)
} -ErrorAction SilentlyContinue | Where-Object {$_.Message -like "*NTLM V1*"}).Count
if ($NTLMv1Count -gt 0) {
Write-Host " FAILED: Found $NTLMv1Count NTLMv1 authentications" -ForegroundColor Red
$Ready = $false
} else {
Write-Host " PASSED: No NTLMv1 usage detected" -ForegroundColor Green
}
# Check Kerberos health
Write-Host "`nChecking Kerberos vs NTLM usage..." -ForegroundColor Yellow
$KerbEvents = (Get-WinEvent -FilterHashtable @{
LogName='Security'
ID=4768,4769
StartTime=(Get-Date).AddHours(-1)
} -ErrorAction SilentlyContinue).Count
$NTLMEvents = (Get-WinEvent -FilterHashtable @{
LogName='Security'
ID=4776
StartTime=(Get-Date).AddHours(-1)
} -ErrorAction SilentlyContinue).Count
if ($NTLMEvents -gt $KerbEvents) {
Write-Host " WARNING: More NTLM than Kerberos (NTLM: $NTLMEvents, Kerberos: $KerbEvents)" -ForegroundColor Yellow
} else {
Write-Host " PASSED: Kerberos is primary authentication method" -ForegroundColor Green
}
# Check performance counters
Write-Host "`nChecking NTLM performance health..." -ForegroundColor Yellow
$Waiters = (Get-Counter '\Netlogon(*)\Semaphore Waiters' -ErrorAction SilentlyContinue).CounterSamples.CookedValue | Measure-Object -Maximum
$Timeouts = (Get-Counter '\Netlogon(*)\Semaphore Timeouts' -ErrorAction SilentlyContinue).CounterSamples.CookedValue | Measure-Object -Sum
if ($Waiters.Maximum -gt 0 -or $Timeouts.Sum -gt 0) {
Write-Host " FAILED: NTLM bottlenecks detected" -ForegroundColor Red
$Ready = $false
} else {
Write-Host " PASSED: No NTLM bottlenecks detected" -ForegroundColor Green
}
# Final verdict
Write-Host "`n=== ASSESSMENT COMPLETE ===" -ForegroundColor Cyan
if ($Ready) {
Write-Host "RESULT: Environment ready for CIS Level 1 baseline" -ForegroundColor Green
} else {
Write-Host "RESULT: DO NOT APPLY CIS Level 1 - Critical issues found" -ForegroundColor Red
Write-Host "`nRequired fixes:"
Write-Host "1. Identify all systems using NTLMv1"
Write-Host "2. Fix Kerberos configuration (SPNs, DNS)"
Write-Host "3. Review inconsistent NTLM version usage"
Write-Host "4. Re-run assessment after fixes"
}
return $Ready
}
# Run the assessment
Test-NTLMReadiness -DaysToCheck 7
What You're Looking For - The Critical Indicators
Before applying CIS L1, you need:
- ZERO instances of "NTLM V1" in Event 4624 for at least 7 days
- Minimal Event 4776 (shows Kerberos failures with NTLM fallback)
- No Events 5816-5819 (NTLM delays/failures)
- No Semaphore Waiters or Timeouts in performance counters
- No systems showing both "NTLM V1" and "NTLM V2" (the randomness problem)
If you see ANY "NTLM V1" in your logs, those authentications WILL fail when you apply "Refuse LM & NTLM". There's no fallback, no retry - they just stop working.
Key Technical Details
Why NTLM Falls Back:
- When clients retry without Extended Session Security, they lack NegotiateFlags
- The server must forward the request to the DC with whatever flags it received
- The DC makes the decision based on its LmCompatibilityLevel setting
- If the DC is at Level 4 or below, it may accept NTLMv1
The MaxConcurrentApi Factor:
- Default is 1 for workstations, 10 for servers/DCs (Windows 2012+)
- Controls how many concurrent NTLM authentications can process
- When exceeded, requests queue (Semaphore Waiters)
- Eventually timeout if queue grows too long
Setting Warning Thresholds:
HKLM\SYSTEM\CurrentControlSet\Services\Netlogon\Parameters
WarningEventThreshold (DWORD) = 5000 (5 seconds in milliseconds)
Conclusion
Before applying CIS Level 1 or any NTLM restrictions, run the assessment script for at least 7 days. If you see any NTLMv1 usage, inconsistent versions, or performance bottlenecks, fix those first. The "minimum security baseline" will expose every authentication weakness in your environment.
This isn't just about Citrix - it affects Exchange, SQL Server, file shares, web applications, RDP, and any system using Windows Integrated Authentication. The random NTLM version behavior is a time bomb waiting in most environments.
Remember: In Windows authentication, what seems random usually isn't - it's just poorly documented default behavior meeting years of accumulated technical debt.