Emails taking 60 minutes from sending to be delivered going to certain email addresses is not a normal situation to be in, this indicated somewhere something was causing messages to queue, luckily, this particular bottleneck was far enough downstream not to cause major disasters for all emails, but only for certain emails, that is the first clue.
After a brief investigation I found myself in the middle of what can only be described as a local domain email storm, what started as routine user complaints about "slow email" quickly escalated into an investigation of a mail storm that peaked at 250,000 messages in a single hour.
This is the complete technical breakdown of how a single misconfigured test server created a cascading failure across our entire hybrid Exchange infrastructure.
The Architecture: A Complex Hybrid Mail Flow
Our mail infrastructure is a carefully orchestrated hybrid setup designed to balance cloud security with on-premises control. Understanding this architecture is crucial to comprehending how the failure propagated:
External Mail Flow Path
Internet Email → Exchange Edge Protection → Exchange Online → Hybrid Connector → Exchange On-Premises → hMailServer → SAP CRM → Notes
Component Deep Dive
- Exchange Edge Protection: Our first line of defense, handling spam filtering and malware detection before messages reach Exchange Online.
- Exchange Online (Office 365): Provides cloud-based email processing and additional security layers. Messages destined for our on-premises systems are routed through the hybrid connector.
- Hybrid Connector: The critical bridge between cloud and on-premises infrastructure. This connector handles authentication, encryption, and routing decisions for messages flowing between environments.
- Exchange On-Premises: Our on-premises Exchange 2019 server handles internal routing and applies transport rules before forwarding to specialized systems.
- hMailServer: An open-source SMTP server we use as a routing hub for legacy systems. It handles the connection to SAP CRM via a dedicated send connector with specific authentication requirements.
- SAP CRM Integration: Business-critical customer relationship management system that requires emails to be processed through specific connectors with custom headers and formatting.
The Investigation: Following the Digital Breadcrumbs
Phase 1: Initial Symptoms
Users reported email delays that started for example on a Wednesday, but the logs later revel it has been brewing for longer than that....
- Delivery delays of 60-65 minutes for external emails - to specific inboxes
- Exchange Online emails processing normally
- Exchange On-Premises processing normally
- No obvious errors/alarms in Exchange Online admin center
- Exchange Online Transport queues showing normal message counts
- Exchange On-Premises Transport queues showing normal message counts
Phase 2: Systematic Flow Analysis
I began tracing a test message through each hop:
Internet → Exchange Online: Message received and processed in <5 seconds Exchange Online → Hybrid: Connector showing normal latency (2-3 seconds) Hybrid → On-Premises Exchange: Normal processing time Exchange On-Premises → hMailServer: Queue building up significantly
Phase 3: The hMailServer Investigation
This is where things got interesting. hMailServer was clearly the bottleneck, but why?
Log File Analysis
# Normal daily log size
-rw-r--r-- 1 hmailserver hmailserver 23M Jun 15 23:59 hmailserver_20250715.log
# Problem days
-rw-r--r-- 1 hmailserver hmailserver 956M Jun 18 14:30 hmailserver_20250718.log
-rw-r--r-- 1 hmailserver hmailserver 1.2G Jun 19 09:15 hmailserver_20250719.log
-rw-r--r-- 1 hmailserver hmailserver 1.1G Jun 19 09:15 hmailserver_20250719.log
-rw-r--r-- 1 hmailserver hmailserver 1.5G Jun 19 09:15 hmailserver_20250719.log
-rw-r--r-- 1 hmailserver hmailserver 1.0G Jun 19 09:15 hmailserver_20250719.log
A 40x increase in log file size over 5 days with the latest file still growing
Connection Limit Analysis
hMailServer configuration showed:
- Maximum SMTP connections: 150
- Maximum delivery threads: 50
- Current active connections: 149/150 (consistently maxed out)
Queue Analysis
SMTP Queue Statistics:
- Messages in queue: 41,667 (every 10-minute interval)
- Average queue time: 55-62 minutes
- Failed delivery attempts: 15,000+ per hour
The Root Cause: A Perfect Storm of Misconfiguration
Deep diving into the logs revealed a pattern of messages from bg44.testing@bear.local
. This address belonged to an internal test server that had been configured with a non-existent email address.
Message Content Analysis
To examine the actual message content, I needed to intercept the emails. First, I checked the latest log entries to see the pattern:
Get-Content "C:\Program Files\hMailServer\Logs\hmailserver_20241218.log" -Tail 20
This revealed the horrifying scope of the problem:
2024-12-18 14:23:17 "SMTPD" 4832 10 "220.152.45.67" "SENT: 250 Message queued for delivery" 2024-12-18 14:23:17 "SMTPD" 4832 10 "220.152.45.67" "RECEIVED: From:
bg44.testing@bear.local " 2024-12-18 14:23:17 "SMTPD" 4832 10 "220.152.45.67" "RECEIVED: To:Alert.Monitor@bear.local " 2024-12-18 14:23:18 "SMTPD" 4833 11 "220.152.45.67" "RECEIVED: Subject: Undeliverable: Causing Mail storms is a bad idea" 2024-12-18 14:23:18 "SMTPD" 4833 11 "220.152.45.67" "SENT: 250 Message queued for delivery" 2024-12-18 14:23:18 "SMTPD" 4834 12 "220.152.45.67" "RECEIVED: From:bg44.testing@bear.local " 2024-12-18 14:23:18 "SMTPD" 4834 12 "220.152.45.67" "RECEIVED: To:Alert.Monitor@bear.local " 2024-12-18 14:23:19 "SMTPD" 4835 13 "220.152.45.67" "RECEIVED: Subject: Undeliverable: Causing Mail storms is a bad idea" 2024-12-18 14:23:19 "SMTPD" 4835 13 "220.152.45.67" "SENT: 250 Message queued for delivery" 2024-12-18 14:23:19 "SMTPD" 4836 14 "220.152.45.67" "RECEIVED: From:bg44.testing@bear.local " 2024-12-18 14:23:19 "SMTPD" 4836 14 "220.152.45.67" "RECEIVED: To:Alert.Monitor@bear.local " 2024-12-18 14:23:20 "SMTPD" 4837 15 "220.152.45.67" "RECEIVED: Subject: Undeliverable: Causing Mail storms is a bad idea" 2024-12-18 14:23:20 "SMTPD" 4837 15 "220.152.45.67" "SENT: 250 Message queued for delivery" 2024-12-18 14:23:20 "SMTPD" 4838 16 "220.152.45.67" "RECEIVED: From:bg44.testing@bear.local " 2024-12-18 14:23:20 "SMTPD" 4838 16 "220.152.45.67" "RECEIVED: To:Alert.Monitor@bear.local " 2024-12-18 14:23:21 "SMTPD" 4839 17 "220.152.45.67" "RECEIVED: Subject: Undeliverable: Causing Mail storms is a bad idea" 2024-12-18 14:23:21 "SMTPD" 4839 17 "220.152.45.67" "SENT: 250 Message queued for delivery" 2024-12-18 14:23:21 "SMTPD" 4840 18 "220.152.45.67" "RECEIVED: From:bg44.testing@bear.local " 2024-12-18 14:23:21 "SMTPD" 4840 18 "220.152.45.67" "RECEIVED: To:Alert.Monitor@bear.local " 2024-12-18 14:23:22 "SMTPD" 4841 19 "220.152.45.67" "RECEIVED: Subject: Undeliverable: Causing Mail storms is a bad idea"
The pattern was unmistakable - the same sender, same subject, rapid-fire delivery every few seconds. To get more context, I pulled a larger sample:
Get-Content "C:\Program Files\hMailServer\Logs\hmailserver_20241218.log" -Tail 200 | Select-String "bg44.testing@bear.local" | Measure-Object
This returned 187 matches out of 200 log entries - meaning 93.5% of recent activity was from this single problematic address!
Then I configured message queuing instead of immediate delivery:
Domain Settings → bg44.testing@bear.local → Queue messages for manual review
Analyse the EML File Revelation
Extracting the EML file revealed the complete picture:
Return-Path: <bg44.testing@bear.local>
Subject: Undeliverable: Causing Mail storms is a bad idea
Auto-Submitted: auto-replied
X-MS-PublicTrafficType: Email
X-MS-Exchange-Organization-AuthSource: exchange.severntrent.local
X-MS-Exchange-Organization-AuthAs: Internal
X-MS-Exchange-Organization-AuthMechanism: 04
X-MS-Exchange-Organization-SCL: -1
X-MS-Exchange-Organization-PCL: -1
X-Auto-Response-Suppress: All
Date: Wed, 18 Dec 2024 14:23:17 +0000
From: <bg44.testing@bear.local>
To: Customer.Care@severntrent.co.uk
Message-ID: <20241218142317.123456@bear.local>
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
This is an automatically generated Delivery Status Notification.
The following message could not be delivered:
[Original message details...]
The Feedback Loop Architecture
The auto-response loop worked like this:
- Test Server sends email from
bg44.testing@bear.local
toAlert.Monitoring@bar.local
- Exchange On-Premises attempts delivery but fails (non-existent sender address)
- Exchange NDR Generation creates a Non-Delivery Report (NDR) back to
bg44.testing@bear.local
- Test Server Auto-Responder receives the NDR and automatically generates a response
- Response Message gets sent back into the system to
Alert.Monitor@bear.local
- Loop Continues indefinitely, with each iteration creating more messages
Numbers showing the problem - to perfection!
Peak Performance Metrics
At the height of the storm, I documented these metrics:
Peak Message Rate: 41,667 messages per 10 minutes
Per-Minute Rate: 4,166.7 messages/minute
Per-Second Rate: 69.4 messages/second
Daily Projection: 6,000,000 messages/day (vs. normal 15,000/day)
Connection Saturation Analysis
With hMailServer's 150 connection limit and 50 delivery threads:
- Theoretical Maximum: 150 concurrent connections
- Observed Utilization: 149/150 (99.3% utilization)
- Thread Pool Exhaustion: 50/50 delivery threads active
- Queue Growth Rate: 4,100+ messages/minute intake vs. 2,500 messages/minute processing
Time to Clear Calculations
Total Backlog: 250,000 messages
Processing Rate: 4,166.7 messages/minute
Time to Clear: 250,000 ÷ 4,166.7 = 60.0 minutes
The mathematical precision was remarkable - the calculated clearance time matched the reported user delay time exactly.
Stop the Mail Storm : Transport Rule
Created a transport rule on Exchange On-Premises to stop the loop:
New-TransportRule -Name "Auto-Response Loop Breaker" `
-SentTo @("Alert.Monitor@bar.local") `
-HeaderMatchesMessageHeader "Auto-Submitted" `
-HeaderMatchesPatterns @("auto-generated", "auto-replied") `
-SetAuditSeverity "Low" `
-DeleteMessage $true `
-RejectMessageReasonText "Auto-response loop detected and terminated"
Performance Impact Analysis
Before the Fix
- Average Email Delay: 55-62 minutes
- System Resource Usage: 95% CPU, 89% Memory
- Connection Pool: 99.3% utilization
After the Fix
- Average Email Delay: <4 seconds
- System Resource Usage: 8% CPU, 29% Memory
- Connection Pool: 8% utilization
Lessons Learned: Technical Best Practices
If you are worried about this, you can implement comprehensive loop detection for all mailboxes in a transport rule like this:
# Advanced loop detection transport rule
New-TransportRule -Name "Advanced Loop Detection" `
-HeaderMatchesMessageHeader "Auto-Submitted" `
-HeaderMatchesPatterns @("auto-generated", "auto-replied", "auto-notified") `
-OrCondition `
-HeaderMatchesMessageHeader "X-Auto-Response-Suppress" `
-HeaderMatchesPatterns @("DR", "NDR", "RN", "NRN", "OOF", "AutoReply") `
-SetAuditSeverity "Low" `
-DeleteMessage $true
Conclusion: The Beauty of Systematic Problem Solving
This incident showcased both the fragility and resilience of complex mail systems. A single misconfigured test server created a cascade failure that affected thousands of users, but systematic investigation and mathematical analysis led to a precise solution.
The key takeaways:
- Log analysis is your best friend - unusual patterns in log files are often the first indicator of systemic issues
- Mathematical modeling works - understanding your system's capacity limits allows for accurate prediction of resolution times
- Circuit breakers are essential - transport rules can serve as effective circuit breakers for runaway processes
- Monitoring must be multi-layered - single points of failure in monitoring can mask critical issues
The most satisfying aspect of this resolution was the mathematical precision: the calculated 60-minute clearance time matched the reported user delay time exactly.