Notice: Due to size constraints and loading performance considerations, scripts referenced in blog posts are not attached directly. To request access, please complete the following form: Script Request Form Note: A Google account is required to access the form.
Disclaimer: I do not accept responsibility for any issues arising from scripts being run without adequate understanding. It is the user's responsibility to review and assess any code before execution. More information

When Emails Attack: The Auto-Response Mail Storm + Response


Emails taking 60 minutes from sending to be delivered going to certain email addresses is not a normal situation to be in, this indicated somewhere something was causing messages to queue, luckily, this particular bottleneck was far enough downstream not to cause major disasters for all emails, but only for certain emails, that is the first clue.

After a brief investigation I found myself in the middle of what can only be described as a local domain email storm, what started as routine user complaints about "slow email" quickly escalated into an investigation of a mail storm that peaked at 250,000 messages in a single hour. 

This is the complete technical breakdown of how a single misconfigured test server created a cascading failure across our entire hybrid Exchange infrastructure.

The Architecture: A Complex Hybrid Mail Flow

Our mail infrastructure is a carefully orchestrated hybrid setup designed to balance cloud security with on-premises control. Understanding this architecture is crucial to comprehending how the failure propagated:

External Mail Flow Path

Internet Email → Exchange Edge Protection → Exchange Online → Hybrid Connector → Exchange On-Premises → hMailServer → SAP CRM → Notes

Component Deep Dive

  1. Exchange Edge Protection: Our first line of defense, handling spam filtering and malware detection before messages reach Exchange Online.
  2. Exchange Online (Office 365): Provides cloud-based email processing and additional security layers. Messages destined for our on-premises systems are routed through the hybrid connector.
  3. Hybrid Connector: The critical bridge between cloud and on-premises infrastructure. This connector handles authentication, encryption, and routing decisions for messages flowing between environments.
  4. Exchange On-Premises: Our on-premises Exchange 2019 server handles internal routing and applies transport rules before forwarding to specialized systems.
  5. hMailServer: An open-source SMTP server we use as a routing hub for legacy systems. It handles the connection to SAP CRM via a dedicated send connector with specific authentication requirements.
  6. SAP CRM Integration: Business-critical customer relationship management system that requires emails to be processed through specific connectors with custom headers and formatting.


The Investigation: Following the Digital Breadcrumbs

Phase 1: Initial Symptoms

Users reported email delays that started for example on a Wednesday, but the logs later revel it has been brewing for longer than that....

  • Delivery delays of 60-65 minutes for external emails - to specific inboxes
  • Exchange Online emails processing normally
  • Exchange On-Premises processing normally
  • No obvious errors/alarms in Exchange Online admin center
  • Exchange Online Transport queues showing normal message counts
  • Exchange On-Premises Transport queues showing normal message counts

Phase 2: Systematic Flow Analysis

I began tracing a test message through each hop:

Internet → Exchange Online: Message received and processed in <5 seconds Exchange Online → Hybrid: Connector showing normal latency (2-3 seconds) Hybrid → On-Premises Exchange: Normal processing time Exchange On-Premises → hMailServer: Queue building up significantly

Phase 3: The hMailServer Investigation

This is where things got interesting. hMailServer was clearly the bottleneck, but why?

Log File Analysis

# Normal daily log size
-rw-r--r-- 1 hmailserver hmailserver 23M Jun 15 23:59 hmailserver_20250715.log

# Problem days
-rw-r--r-- 1 hmailserver hmailserver 956M Jun 18 14:30 hmailserver_20250718.log
-rw-r--r-- 1 hmailserver hmailserver 1.2G Jun 19 09:15 hmailserver_20250719.log
-rw-r--r-- 1 hmailserver hmailserver 1.1G Jun 19 09:15 hmailserver_20250719.log
-rw-r--r-- 1 hmailserver hmailserver 1.5G Jun 19 09:15 hmailserver_20250719.log
-rw-r--r-- 1 hmailserver hmailserver 1.0G Jun 19 09:15 hmailserver_20250719.log

A 40x increase in log file size over 5 days with the latest file still growing

Connection Limit Analysis

hMailServer configuration showed:

  • Maximum SMTP connections: 150
  • Maximum delivery threads: 50
  • Current active connections: 149/150 (consistently maxed out)

Queue Analysis

SMTP Queue Statistics:
- Messages in queue: 41,667 (every 10-minute interval)
- Average queue time: 55-62 minutes
- Failed delivery attempts: 15,000+ per hour

The Root Cause: A Perfect Storm of Misconfiguration

Deep diving into the logs revealed a pattern of messages from bg44.testing@bear.local. This address belonged to an internal test server that had been configured with a non-existent email address.

Message Content Analysis

To examine the actual message content, I needed to intercept the emails. First, I checked the latest log entries to see the pattern:

Get-Content "C:\Program Files\hMailServer\Logs\hmailserver_20241218.log" -Tail 20

This revealed the horrifying scope of the problem:

2024-12-18 14:23:17	"SMTPD"	4832	10	"220.152.45.67"	"SENT: 250 Message queued for delivery"
2024-12-18 14:23:17	"SMTPD"	4832	10	"220.152.45.67"	"RECEIVED: From: bg44.testing@bear.local"
2024-12-18 14:23:17	"SMTPD"	4832	10	"220.152.45.67"	"RECEIVED: To: Alert.Monitor@bear.local"
2024-12-18 14:23:18	"SMTPD"	4833	11	"220.152.45.67"	"RECEIVED: Subject: Undeliverable: Causing Mail storms is a bad idea"
2024-12-18 14:23:18	"SMTPD"	4833	11	"220.152.45.67"	"SENT: 250 Message queued for delivery"
2024-12-18 14:23:18	"SMTPD"	4834	12	"220.152.45.67"	"RECEIVED: From: bg44.testing@bear.local"
2024-12-18 14:23:18	"SMTPD"	4834	12	"220.152.45.67"	"RECEIVED: To: Alert.Monitor@bear.local"
2024-12-18 14:23:19	"SMTPD"	4835	13	"220.152.45.67"	"RECEIVED: Subject: Undeliverable: Causing Mail storms is a bad idea"
2024-12-18 14:23:19	"SMTPD"	4835	13	"220.152.45.67"	"SENT: 250 Message queued for delivery"
2024-12-18 14:23:19	"SMTPD"	4836	14	"220.152.45.67"	"RECEIVED: From: bg44.testing@bear.local"
2024-12-18 14:23:19	"SMTPD"	4836	14	"220.152.45.67"	"RECEIVED: To: Alert.Monitor@bear.local"
2024-12-18 14:23:20	"SMTPD"	4837	15	"220.152.45.67"	"RECEIVED: Subject: Undeliverable: Causing Mail storms is a bad idea"
2024-12-18 14:23:20	"SMTPD"	4837	15	"220.152.45.67"	"SENT: 250 Message queued for delivery"
2024-12-18 14:23:20	"SMTPD"	4838	16	"220.152.45.67"	"RECEIVED: From: bg44.testing@bear.local"
2024-12-18 14:23:20	"SMTPD"	4838	16	"220.152.45.67"	"RECEIVED: To: Alert.Monitor@bear.local"
2024-12-18 14:23:21	"SMTPD"	4839	17	"220.152.45.67"	"RECEIVED: Subject: Undeliverable: Causing Mail storms is a bad idea"
2024-12-18 14:23:21	"SMTPD"	4839	17	"220.152.45.67"	"SENT: 250 Message queued for delivery"
2024-12-18 14:23:21	"SMTPD"	4840	18	"220.152.45.67"	"RECEIVED: From: bg44.testing@bear.local"
2024-12-18 14:23:21	"SMTPD"	4840	18	"220.152.45.67"	"RECEIVED: To: Alert.Monitor@bear.local"
2024-12-18 14:23:22	"SMTPD"	4841	19	"220.152.45.67"	"RECEIVED: Subject: Undeliverable: Causing Mail storms is a bad idea"

The pattern was unmistakable - the same sender, same subject, rapid-fire delivery every few seconds. To get more context, I pulled a larger sample:

Get-Content "C:\Program Files\hMailServer\Logs\hmailserver_20241218.log" -Tail 200 | Select-String "bg44.testing@bear.local" | Measure-Object

This returned 187 matches out of 200 log entries - meaning 93.5% of recent activity was from this single problematic address!

Then I configured message queuing instead of immediate delivery:

Domain Settings → bg44.testing@bear.local → Queue messages for manual review

Analyse the EML File Revelation

Extracting the EML file revealed the complete picture:

Return-Path: <bg44.testing@bear.local>
Subject: Undeliverable: Causing Mail storms is a bad idea
Auto-Submitted: auto-replied
X-MS-PublicTrafficType: Email
X-MS-Exchange-Organization-AuthSource: exchange.severntrent.local
X-MS-Exchange-Organization-AuthAs: Internal
X-MS-Exchange-Organization-AuthMechanism: 04
X-MS-Exchange-Organization-SCL: -1
X-MS-Exchange-Organization-PCL: -1
X-Auto-Response-Suppress: All
Date: Wed, 18 Dec 2024 14:23:17 +0000
From: <bg44.testing@bear.local>
To: Customer.Care@severntrent.co.uk
Message-ID: <20241218142317.123456@bear.local>
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit

This is an automatically generated Delivery Status Notification.
The following message could not be delivered:
[Original message details...]

The Feedback Loop Architecture

The auto-response loop worked like this:

  1. Test Server sends email from bg44.testing@bear.local to Alert.Monitoring@bar.local
  2. Exchange On-Premises attempts delivery but fails (non-existent sender address)
  3. Exchange NDR Generation creates a Non-Delivery Report (NDR) back to bg44.testing@bear.local
  4. Test Server Auto-Responder receives the NDR and automatically generates a response
  5. Response Message gets sent back into the system to Alert.Monitor@bear.local
  6. Loop Continues indefinitely, with each iteration creating more messages

Numbers showing the problem - to perfection!

Peak Performance Metrics

At the height of the storm, I documented these metrics:

Peak Message Rate: 41,667 messages per 10 minutes
Per-Minute Rate: 4,166.7 messages/minute
Per-Second Rate: 69.4 messages/second
Daily Projection: 6,000,000 messages/day (vs. normal 15,000/day)

Connection Saturation Analysis

With hMailServer's 150 connection limit and 50 delivery threads:

  • Theoretical Maximum: 150 concurrent connections
  • Observed Utilization: 149/150 (99.3% utilization)
  • Thread Pool Exhaustion: 50/50 delivery threads active
  • Queue Growth Rate: 4,100+ messages/minute intake vs. 2,500 messages/minute processing

Time to Clear Calculations

Total Backlog: 250,000 messages
Processing Rate: 4,166.7 messages/minute
Time to Clear: 250,000 ÷ 4,166.7 = 60.0 minutes

The mathematical precision was remarkable - the calculated clearance time matched the reported user delay time exactly.

Stop the Mail Storm : Transport Rule

Created a transport rule on Exchange On-Premises to stop the loop:

New-TransportRule -Name "Auto-Response Loop Breaker" `
  -SentTo @("Alert.Monitor@bar.local") `
  -HeaderMatchesMessageHeader "Auto-Submitted" `
  -HeaderMatchesPatterns @("auto-generated", "auto-replied") `
  -SetAuditSeverity "Low" `
  -DeleteMessage $true `
  -RejectMessageReasonText "Auto-response loop detected and terminated"

Performance Impact Analysis

Before the Fix

  • Average Email Delay: 55-62 minutes
  • System Resource Usage: 95% CPU, 89% Memory
  • Connection Pool: 99.3% utilization

After the Fix

  • Average Email Delay: <4 seconds
  • System Resource Usage: 8% CPU, 29% Memory
  • Connection Pool: 8% utilization

Lessons Learned: Technical Best Practices

If you are worried about this, you can implement comprehensive loop detection for all mailboxes in a transport rule like this:

# Advanced loop detection transport rule
New-TransportRule -Name "Advanced Loop Detection" `
  -HeaderMatchesMessageHeader "Auto-Submitted" `
  -HeaderMatchesPatterns @("auto-generated", "auto-replied", "auto-notified") `
  -OrCondition `
  -HeaderMatchesMessageHeader "X-Auto-Response-Suppress" `
  -HeaderMatchesPatterns @("DR", "NDR", "RN", "NRN", "OOF", "AutoReply") `
  -SetAuditSeverity "Low" `
  -DeleteMessage $true

Conclusion: The Beauty of Systematic Problem Solving

This incident showcased both the fragility and resilience of complex mail systems. A single misconfigured test server created a cascade failure that affected thousands of users, but systematic investigation and mathematical analysis led to a precise solution.

The key takeaways:

  • Log analysis is your best friend - unusual patterns in log files are often the first indicator of systemic issues
  • Mathematical modeling works - understanding your system's capacity limits allows for accurate prediction of resolution times
  • Circuit breakers are essential - transport rules can serve as effective circuit breakers for runaway processes
  • Monitoring must be multi-layered - single points of failure in monitoring can mask critical issues

The most satisfying aspect of this resolution was the mathematical precision: the calculated 60-minute clearance time matched the reported user delay time exactly.

Previous Post Next Post

نموذج الاتصال