Schrödinger's Server Alert: Both Read and Unread Until Someone Checks

Mac Mini caching servers, critical infrastructure that significantly reduces internet bandwidth consumption across the business, experienced multiple failures. Despite having a comprehensive web dashboard displaying the health status of all servers, the issues went unnoticed for hours/days

The problem? No one was actively monitoring the dashboard when the failures occurred. It was only when I happened to check and pointed out the issues that action was taken. This delay could have had serious consequences, as these caching servers play a vital role in managing our bandwidth efficiently. Without them, our internet usage spikes dramatically, potentially affecting business operations across multiple sites.

I would also suspect the reason nobody noticed the caching servers were degraded was more linked to the Internet bandwidth was not looking saturated, as the "primary" dense site caching server was online.

The Email Alert "Notification”

This experience led to the development of an automated email notification system. But let's be clear about what this actually is: a backup system for our failed primary system, which will likely fail in similar ways.

This is visually what that looked like, lets start with the technical email, where he you can see London is Unhealthy....

Then we have the support team notification email which advises of a problem and that this issue will be fixed and to check the dashboard:

Then when the problem is resolved, using the -Resolved switch on the Powershell script this will notify that everything is now back to normal:

1. Persistent Notifications

During core business hours, the system sends email alerts every hour until the problem is resolved. Why hourly? Because sending them more frequently guarantees they'll be filtered as spam (mentally or literally), and sending them less frequently means even longer periods of system failure.

Of course, these hourly emails will likely join the hundreds of other unread emails in most inboxes. But maybe - just maybe - someone will notice the growing count of identical subject lines - I needed to vary the content.

2. Role-Based Information Distribution

The system sends two distinct types of emails:

Technical Team Emails

These contain comprehensive details including:

Specific sites experiencing issues
Exact nature of the problems
Health status indicators
Timestamp information

This gives technicians immediate, actionable intelligence about what needs to be fixed and where to focus their efforts.

Support Team Emails

These are intentionally simplified, containing:

A notification that issues have been detected
Confirmation that technical staff are investigating
Clear messaging that no action is required from support staff
A reminder to monitor the dashboard for updates

3. Clear Communication Hierarchy

By separating technical and support notifications, the system prevents a common problem in incident response: well-meaning but unqualified staff attempting to help. When support teams receive vague error notifications, they may turn to AI chatbots or search engines for solutions. These tools, while helpful in many contexts, can confidently provide incorrect guidance that leads to configuration errors and potentially worsen the situation.

The Importance of Resolution Notifications

One critical aspect often overlooked in alert systems is confirming when issues are resolved. Technical staff know when they've fixed a problem, but support teams are left wondering if the situation has been addressed.

Our solution includes a -Resolved parameter that, when activated, sends a clear "all clear" notification to support teams. This simple addition:

Reduces unnecessary follow-up inquiries
Provides closure to the incident
Maintains confidence in the monitoring system
Documents the resolution time for future analysis

Technical Implementation

The PowerShell script monitors an HTML dashboard, searching for specific health status indicators. It identifies servers marked as "unhealthy" while ignoring those marked as "healthy" or "healthy out of hours" (for servers that may be offline during maintenance windows).

The script's intelligence lies in its parsing logic, which extracts:

Site names from the dashboard HTML
Associated health status for each site
Only sites requiring immediate attention

This targeted approach ensures that alerts are meaningful and actionable, avoiding the "alert fatigue" that plagues many monitoring systems.

By ensuring rapid detection and response to caching server failures, we protect not just IT infrastructure but the broader business operations that depend on reliable network performance.

Conclusion

Let's drop the pretense: email alerts aren't a magic solution to the dashboard monitoring problem. They're simply another channel that can be ignored. Unread email counts in the thousands are common, and "alert fatigue" is real regardless of the delivery mechanism.

However, by implementing email alerts alongside visual dashboards, we're essentially rolling two dice instead of one. The probability that someone will notice an issue increases marginally - maybe someone checks their email during a boring meeting, or maybe the repetitive alerts finally annoy someone enough to take action.

Schrödinger's Server Alert: Both Read and Unread Until Someone Checks

نموذج الاتصال