Silent Treatment: When remote daemon starts with radio silence...

This post is a follow on from the post here

Recently upgraded our cache monitor daemon monitoring script to handle remote starting across multiple Mac minis. The existing script could detect running daemons just fine, but when services were actually stopped, it couldn't restart them remotely due to sudo authentication issues.

What was the problem?

The PowerShell monitoring script worked great for detection, but when daemons were genuinely stopped across all servers, manual intervention was required on each machine. We needed the script to handle:

Different daemon implementations (Python vs shell scripts)
Various sudo configurations across servers
Remote authentication without hanging
Clean logging regardless of start method

The cause of the problem(s)

Two issues were causing the confusion:

Issue #1: Filename Mismatch The monitoring script was searching for cache_monitor.py but one server was running cache_monitor.sh (shell script vs Python script). Simple oversight, but it completely broke detection.

Issue #2: Legacy PID Reference The script had somehow picked up a hardcoded PID in the search pattern:

$CheckProcessCommand = "ps aux | grep -E 'cache_monitor\.py|1005' | grep -v grep"

This was likely left over from debugging or testing - searching for either the script name OR process ID 1005. On one server, there happened to be a system daemon (_nsurlsessiond) running with PID 1005, so the script incorrectly identified that as our cache monitor!

How was the problem resolved?

I ran in to a couple of issues, let go though those now, one was of my own making for testing that I overlooked.

Flexible Process Detection

# Before: Too specific and brittle
$CheckProcessCommand = "ps aux | grep -E 'cache_monitor\.py|1005' | grep -v grep"

# After: Handles both file types, no hardcoded PIDs
$CheckProcessCommand = "ps aux | grep -E 'cache_monitor\.(py|sh)' | grep -v grep"

Smart Sudo Handling

When the script tried to start the daemons, it hit sudo password prompts that caused hanging. Added intelligent sudo detection:

# Test if passwordless sudo works first
$TestSudoCommand = "sudo -n true 2>/dev/null && echo 'PASSWORDLESS_OK' || echo 'PASSWORD_REQUIRED'"

# Use appropriate method based on test result
if ($SudoTestOutput -match "PASSWORDLESS_OK") {
    $StartCommand = "sudo ./cache_monitor.py start"
} else {
    $StartCommand = "echo '$Password' | sudo -S ./cache_monitor.py start"
}

Timeout Protection with Smart Verification

The trickiest part was handling start commands that would hang but actually succeed. This was resolved by adding PowerShell job-based timeouts with immediate verification:

$Job = Start-Job -ScriptBlock { /* start command */ }
$JobResult = Wait-Job -Job $Job -Timeout 15

if (!$JobResult) {
    Remove-Job -Job $Job -Force
    Write-Host "Command timed out, checking if service started anyway..."
    
    # Immediate verification check
    $ImmediateCheck = & $PuttyPath -batch -l $Username -pw $Password $RemoteHost $CheckProcessCommand
    
    if ($ImmediateOutput -match "cache_monitor\.(py|sh)") {
        # Service actually started successfully despite timeout
        return "Started Successfully"  # Clean status for CSV logging
    }
}

The Timeout Success Pattern

Many times the sudo start command would hang (never return), but the daemon would actually start successfully. The key insight was checking the service state immediately after timeout rather than assuming failure. For logging purposes, we simplified the status to just "Started Successfully" rather than flagging it as a timeout-related issue, since the end result was identical to a normal start.

Diagnostic Commands

If you're troubleshooting similar issues:

# Check what's actually running
ps aux | grep your_daemon_name

# Test sudo configuration  
sudo -n true && echo "Passwordless OK" || echo "Password required"

# Find processes by partial name (safer than PID)
pgrep -f daemon_name

Remember: monitoring scripts should adapt to your environment, not the other way around.

Silent Treatment: When remote daemon starts with radio silence...

نموذج الاتصال