Network packets Rx delayed by as much as 12 milliseconds in an NDIS protocol driver on Windows 10

Initially, I thought this is related to Interrupt Moderation / Coalescing on the network interfaces, but apparently that is not related.

This does correspond to what LatencyMon (from Resplendence) reports as Interrupt to Process latency (with peaks of ~12 milliseconds).

There’s no pattern to it that I could find. These bursts keeps happening 5 - 30 seconds apart, with no correlation to other CPU activity.

At the same time neither LatencyMon, nor xperf identifies any DPC or ISR from any driver as taking longer than 250 microseconds (i.e. peak, averages are much lower).

LatencyMon suggests disabling CPU throttling (which I have already done).

Not sure if it’s related, but in a user mode app, if I write a simple while loop with a 1 millisecond sleep, from time to time it also wakes up after 12 milliseconds (instead of 1 or 2) - once again corresponding to the peaks shown by LatencyMon.

I’ve tried removing or disabling most other devices (WiFi, BlueTooth, other USB devices, VirusScan, FireWalls, etc.), but none of it seems to make a difference).

It’s almost as if something is disabling interrupts for 12 milliseconds.

Any ideas on finding what might be causing this, or how to fix this would be greatly appreciated.

Thanks.
Devendra.

ps: The problem happens only on some machines.

dparakh wrote:

This does correspond to what LatencyMon (from Resplendence) reports as Interrupt to Process latency (with peaks of ~12 milliseconds).

Is this on one specific make/model of machine, or does it happen on many
machines?  Many years ago, I had a machine with a RealTek sound card,
where the RealTek audio driver would occasionally spin with interrupts
disabled for 50 milliseconds.  It played havoc with video capture, which
is what I was trying to do.

Not sure if it’s related, but in a user mode app, if I write a simple while loop with a 1 millisecond sleep, from time to time it also wakes up after 12 milliseconds (instead of 1 or 2) - once again corresponding to the peaks shown by LatencyMon.

How are you doing the sleep?  A user-mode sleep can’t usually get any
more precise than the scheduling interval (about 16ms), unless some
multimedia app has changed the interval using timeBeginTime.

Thanks Tim.

This happens on at least three PCs - they are not exactly same model, but they are all Dell Desktops, I think they’ve all got RealTek audio devices.

I tried disabling the RealTek device - but the problem persists.

The problem sounds very similar to what you describe, perhaps in some other driver.

I’m curious - how did you figure out it was the RealTek driver? Do you think Driver Verifier can be of help here?

For the user mode app, I did change the interval with timeBeginTime(1), but the timer interval was already .5 ms, as reported by SysInternals ClockRes64).

Thanks.
Devendra.

It’s almost as if something is disabling interrupts for 12 milliseconds.

That sounds suspiciously like https://en.wikipedia.org/wiki/System_Management_Mode

At the same time neither LatencyMon, nor xperf identifies any DPC or ISR from any driver as taking longer
than 250 microseconds (i.e. peak, averages are much lower).

I recall a thread around 12-13 years ago where a poster was complaining about some DRM protection scheme (I think it was called StarForce of something similar). It was taking advantage of the fact that no threads can run on a CPU until its DPC queue gets flushed. What it did was making a DPC re-queue itself to the head of the queue over and over again, effectively freezing the UP system for up to 5 seconds(!!!), while still formally complying with certification requirement of not exceeding 100us threshold upon any particular run.

Therefore, measuring maximum execution time of any particular DPC does not not really seem to be that indicative, does it. Just consider what happens if an interrupt occurs while DPC routine runs, and an ISR queues a DPC again, which is an absolutely normal scenario when the network traffic is heavy…

Not sure if it’s related, but in a user mode app, if I write a simple while loop with a 1 millisecond sleep, from time to time
it also wakes up after 12 milliseconds (instead of 1 or 2) - once again corresponding to the peaks shown by LatencyMon.

…and now the whole thing already looks more of a scheduler-related latency, taking into consideration that you are getting almost identical latencies in totally unrelated situations…

Please note that if a thread gets unblocked it does not necessarily imply that it will start running straight away.Everything depends on how its priority fares against the ones of the other threads that are eligible for execution. You can test this part pretty easily - just pin your target thread to some CPU, give it the highest priority, and see what happens when it gets unblocked. If you still get the same patterns of
DPC -to - process latency, then the problem very obviously lies elsewhere.

However, taking into consideration that you seem to be getting exactly the same latency of approx. 12ms, something tells me that the whole thing may well be scheduler-related, because I would expect much higher variety in “random” cases like heavy network traffic or SMI
( although I would not rule out the latter either)…

Anton Bassov

Thank You Jeffrey and Anton!!

I will investigate along these lines - will update here when/if I get this fixed!

Devendra.

So after investigating this more, I’ve finally zero-ed in on what triggers this behavior…

Everything works fine until some code increases the timer resolution to 1 millisecond (the default is 15 on the PC that I’m working on).

Once the resolution is increased, within a few seconds, LatencyMon starts reporting high interrupt to process latency (and the network packets also start getting delayed, and the sleeps start waking up late).

It does not matter whether the resolution is increased by user mode code (timeBeginPeriod) or kernel mode (ExSetTimerResolution).

And the problem goes away after the resolution is restored back to the default value.

It seems more and more like what @anton_bassov above refers to as relating to the scheduler.

Any suggestion on how to investigate this further or how to fix this?

Also, in addition to the Dell Desktops, we have also seen this on another type of PCs - MacBook Pros running Bootcamp.

Thanks.
Devendra.

As it turns out, I was using Windows RS4 (1803). Once I upgraded to RS5 (1809) - these problems have gone away.

Now I’m not sure if this is because Microsoft fixed something in RS5, or because the upgrade knocked out some driver which was causing a problem.

For now, things look good after the update. I’m not going to complain!

Thanks.
Devendra.