ExpressRoute : The Mystery of the "excessive traffic"

If you are using Azure and you want a dedicated link to Azure backbone then you will be using Express Route, however if you are using Express Route and the circuit gets busy then you can have issues talking to and from Azure depending on the load of that connection - that is where this article starts. πŸ‘

The Raw Data

You are assigned a dedicated bandwidth value and then you have "burst" value which should not be sustained for long period of time, so take this as normal load, nice the night axis on this is 400mb/s, this is the weekend so that is the low baseline:

Drop into the working week and the baseline for a working week is but to 1Gb/s as you can see below, so that should be your baseline for normal.

That would mean this is very abnormal for a week, notice that the high graph value is 3.5Gb/s and this usually means when you are this high that communications between Azure and your local DC become very unresponsive, the amber line is the "warning line" and the red line is the "performance issues" line.

This shows the point at which it the problem was resolved as this is the next working day when it trends back to normal, it started on the same path as the day before, but after the issue was resolved the trending went back to normal (ish):

The Investigation

This is where it gets fun, if we take a look at ExpressRoute we are looking for a curve like this that seems to sharply increase that many people would link to "users logging in" - well yes but what during the login process was causing the issue........we now know the curve we are looking for, so out comes the investigation cap......

This is the blueprint of the issue, so the first port of call is "top talkers" which defines all the devices talking the most to certain endpoints, if you work for a company that is large then there is a high change you will a VPN solution that allows your laptops to connect to your corporate network, so that is usually a good place to start and indeed that is where this started.

VPN Logging and Investigation

First we need a normal day of traffic this will tell you that 05:30 is when connections start and continue for most of the day after that

However that chart is not that helpful we need to know the data that is used here and in the example the top process name across all laptops for this example 331gb for the day is a client connectivity application that provides a remote desktop service.

However if we go to the problem application on the day of the issue, you can notice that the same process across all laptops is now using 934GB of data

This means on the problem day that the traffic via this remote desktop solution is 603GB higher than it should be across all laptops for that day and indeed until the problem is fixed.

So, lets drill into this top process which will then tell me all the devices that have that process running from a name point of view and a VPN group name and for added bonus the amount of data that device has transferred though that process:

Then the destination for these devices was the remote desktop server farm, as you can see below as all these are RDP host servers and they have in some example been doing "outside normal parameters" data usage......

Remote Desktop Connections

So we now know in this case the top devices and the top destinations they are talking to, so if you add this up, people with a device in this case a laptop are using the RDP client which would look like this:

In this example its a little more complicated than this as there is a remote desktop gateway server that manages all the servers behind it, which means clients would be connecting to the remote desktop gateway which is on the domain "beardp-gw.bear.local"  which would then in turn assign them the relevant server based on the load balancer.

Finding the high data usage

The problem here is, if you have a laptop that is being used as a thin client then the process should be done the remote server, not the local laptop, so where is all this data coming from, well for that we need the know the user, that is also outlined below with data usage:

This now tells us the following information from three tables (the ones above)

User:  bear.1
Device : bearclaws.bear.local
Endpoint : bear-rdp-host1. bear.local

Looking at the users processes

So we have a user using their laptop as a thin client and that RDP session is using excessive amounts of data, so as we know the host lets look at what is running for that user which should tell us what the user is running which is causing the issues with high data usage, so this is the users running processes:

That means we have the following open:

Edge - Insights for VPN
Firefox - Blank Tab open
Teams - Chats Opens
Chrome - Intranet Open

Remote Desktop should not equal high data?

Right, so with this open why have we used 32GB of data then, that makes no sense, so these are normal applications that should not be using large amounts of data, but we are looking for something that talks back to the server with moving or animated data on it - as that would cause traffic possibly.

Intranet Application Insights

The intranet is open, that the one called Bear-LocalNet and as it happens we have analytics on that Intranet, so lets take a look at them, this can be provided using those analytics, lets start with server requests, this shows when its used and the "flat line" is obviously outside business hours:

However, if we switch to session, which will tell you how many "people" are on the website then this is the chart we get, and this is where it gets interesting:

Intranet v ExpressRoute : Graphs and similar data

We were looking for a upward curve from earlier that was crippling Express route if you remember this was the curve we saw.....

Well this curve looks very similar right, and we are looking for the same type of curve, so does that mean it could be the Intranet that is causing this excessive usage do you reckon, has something changed on the Intranet???

We are currently looking at a single user, but if you remember the Intranet is the default homepage for your company, that means a small change to the Intranet could have a massive knock on effect when all your users are accessing it every time their browser fires up.

RDP using more data by over 300% (across all devices)

If you are using RDP then anything that changes on the screen or re-writes requires data to send to the device, so if you have a website like Google then all is well, there is nothing moving on that, so you will see very little data transferred from the RDP session to the laptop, but if the background for example was moving like a lava lamp that would cause an excessive amount of data as the screen was being re-drawn all the time.

If we take a normal week then compare that to when the issue was occurring, this is across all the laptops and as you can see we have a 381% increase in Chrome traffic where the Intranet is being used.

Screen Refresh 101

When you were using a remote desktop connection it used to be back in the days of NT4/Server 2000 when something on the desktop changed it refreshed the whole screen, which means if you had a clock on your taskbar, you will get a screen update at a minimum of one refresh minute, this is also the reason why you could hide the clock.

Technology has obviously moved forward, Now only items that move trigger and refresh so the example of the clock if it’s incremented by one minute, the minute section of the clock will trigger a refresh that then not refresh the whole screen, likewise, on a website that is static once the website has loaded, The refresh will not occur, except when things change.

Therefore, if you take that static website and put a stock ticker that updates every five minutes then the minimum refresh for the ticker outlet will be every five minutes.

The more screen content that changes the data this will generate, this is how the remote desktop refresh works, If you happen to be using Citrix, they have technology called Speed Screen Latency Reduction - which is generally reserved for slower connections to avoid interruptions in the session.

If you take a real world example of this, If you get the utility from Sysinternals called Procmon The job at this time is to list everything that’s accessing the registry or a file on the operating system in real time and it looks like this:

If you run this on a laptop or a device where it’s locally being run, this application will perform absolutely fine, However, if you leave the filter disabled and you run this on any kind of remote desktop connection, you will notice you have trouble rendering the processes in real time, when you attempt to view the process information it seems to go white and doesn’t respond, The problem is, it’s trying to update that window too many times and it can’t keep up with the updates - that's the white box on the screen that will appear once its "filled the window"......

You will notice the amount of data captured, and the buffer space will increase and work normally (the image below) because there’s a limited amount of information to refresh, but when you seem to overload the buffer, you get a nonresponsive window inside the application, but this is not a non-responsive process, It’s just a section of the window that’s trying to update too quickly..

If you wish to see an example of this visit this website here - this is the good old toaster screensaver from after dark era, however because the toasters and the toast are flying across the screen, in this example the movement would usually require the whole window to be redrawn even though it is mainly black.

In a more extreme example lets but search lights in the background and have a waving hand in the foreground using CSS3 this can be seen here and a demo is below:

In this example the whole background is moving randomly and there is a waving hand in the text meaning that post of this website will need to be rendered all the time, this will cause data usage to get rather large for an individual user, so multiply this for all users as the homepage and that will replicate your issue.

Live Demo with website 

If you look at task manager before this demo is run you can see we have 7% CPU and the top process is not Chrome, its task manager as you would expect - that is causing all the updates to the interface.

Then will the networking added you will see that the network card is as idle as it can be for a active RDP session:

Now lets fire up that website with the lava background and the waving text and look at the results, you will immediately see that Chrome is now the top CPU process which is expected:

When you look at the network statistics as well we have gone from Idle to 7.1Mbps send and since the website was loaded the network card has been working overtime, generated as predicted lots of data as you can see from the chart.

This is one device, now multiple that by say 5,550 devices and you can also add people using other methods of using RDP that are not laptops, if you take this 5,550 and apply that logic of 7.1Mbps to each device you end up with a peak and worst case of 39.405Gbps over all devices.

Check the DevTools

Well no, not for this example, if I load the website that was causing latency in Express Route this is what that looks like in DevTools:

That tells you that the website loaded 4 items and it used 290 bytes of data, however this does not tell you about the network card over utilization you get from a website with motion in a VDI platform, this is because as far a Chrome is concerned the website has been loaded and its job is done.

However if the DevTools reports something like below then you have some work to do on your website requests as a website would not consist of 771 requests, there is something wrong with the design and components on the website.

Cause of Express Route Overload

If we get back on track to the example at hand, this would indicate that on the Intranet code has changed something that has caused additional refresh and additional network traffic, however in this example the "lava example" was the cause of the issue, someone thought it would be a good idea to have a moving background on the Intranet that caused the excess load on the Express route.

A simple removal of this moving background resolved the issue and Express route "peak" returned to a more normal baseline, the key here is keep your main browser start page being simple, efficient and if using VDI resources static, advice is next.

Intranet Design : Advice

If you have a large company and the Intranet (or Extranet) will be the default homepage for all employees then you need to follow these guidelines, as failing to follow these will mean a small change on a innocent website (which this article is all about) will then cause a much larger problem upstream with other connectivity points.
  1. Responsive Design - Ensure the website is mobile-friendly and performs well on various devices.
  2. Optimized Images  - Compress images to reduce load times without compromising quality.
  3. Minimize HTTP Requests - Reduce the number of elements on a page to decrease load times.
  4. Caching - Implement browser and server-side caching to speed up load times.
  5. Optimize Code: - Use Minify CSS, JavaScript, and HTML to reduce file sizes and improve loading speed
  6. Efficient Database Queries - Optimize database queries and use indexing to enhance performance.
  7. Lazy Loading - Implement lazy loading for images and videos to improve initial load times.
  8. Server Optimization - Ensure the server is optimized for performance, using appropriate hardware and software configurations.
  9. Monitor Performance :Use tools like Google PageSpeed Insights, Cloudflare Observatory or Pingdom to regularly check and improve website performance.
  10. Consider you target devices - If you website is used by VDI or thin client think about the code and load of those devices.
  11. Motion and Animation requires planning - If you are using VDI based platform or platforms with limited hardware resources, then think about moving or animated objects
  12. Optimise Requests for website - You should not be requiring hundreds or thousands of requests to get your website to a loaded point, you can check this with F12 or DevTools and that can be captured using the "Network" tab.
Previous Post Next Post

Ω†Ω…ΩˆΨ°Ψ¬ Ψ§Ω„Ψ§ΨͺΨ΅Ψ§Ω„