Limiting DPC time to prevent PASSIVE_LEVEL starvation

I was shown a curious issue today, and am looking for feedback on how people
have handled similar situations.

The basic issue seems to be as the data rate on a driver increases, it
eventually consumes 100% of the cpu in it’s DPC processing. This seems to
cause cpu starvation for all passive_level threads, which causes bad things
to happen. I’ve suggested that either the hardware needs some sort of
interrupt rate tuning capability (so the DPC can notice it’s hogging the
cpu, and give it up for less than the system time slice <10 milliseconds>),
or else it needs enough buffers queued to the hardware that the DPC can stop
executing for a full time slice without overflowing the buffers. I suppose
another option might be when loads get high to stop processing in
dispatch_level, and activate a passive_level thread to do the processing and
adjust the thread priority to something appropriate (perhaps dynamically).
The thread scheduler would then essentially arbitrate between other threads
and data rates.

The device looks like an NDIS miniport on the top, but has WDM out the
bottom. Our testing group is able to generate packet rates high enough to
essentially flood the driver with data (the actual hardware data channel is
real fast). The testing group likes to install the packet bridge between two
of these and insert the whole path between groups of machines that generate
and consume packets, so there is no TCP flow control to limit things. I
suspect using the bridge causes the received packets to immediately get
transmitted, all at dispatch_level. I suggested to the test group that the
data rate would get limited if the machine was doing any real work, and
using it as a 2 port network switch is not exactly a normal operational
scenario. Their reply was that if they do the same test with more normal
network devices, they don’t consume 100% of the cpu at DPC level. It might
just be our drivers are not yet as performance tuned as the ones running on
millions of systems, and when they are, there will not be much different in
behavior. Still, I’m actually happy to see our testing group find ways to
break things. I tend to feel if they can’t break it, they are just not
testing it deeply enough.

Any thoughts?

  • Jan

A couple of comments:

Right. As we all know, running at IRQL DISPATCH_LEVEL prevents thread scheduling. Back in the day, before we had CPUs as fast as we have now, the ability to exhaust all the CPU time on the system doing packet processing in your network driver’s DpcForIsr was pretty common.

The fix I’ve used is to limit the amount of time your driver spends in its DPC, typically by limiting the number of consecutive “loops” (checking for recieved packets complete, transmitted packets complete, and device service requests) that you let your driver perform. After some number of “loops” have your DpcForIsr queue a timer callback that calls your DPC after some time has elapsed that allows the system to otherwise make forward progress.

It’s relatively easy to contrive situations that’ll break a network driver. Or any driver for that matter. These situations might be fun for stress testing – just to see what your driver will do under far edge conditions – but if the situations you cook-up can’t occur in the wild, I’d suggest that they’re not something to worry too much about.

I’m not sure I completely understand the network topology you’re describing, but pumping packets into a system at IRQL DISPATCH_LEVEL and receiving them back on the same system (and processing them at IRQL DISPATCH_LEVEL) doesn’t seem to be the kind of thing a customer would see in “real life.”

If they can’t consume 100% of the CPU at IRQL DISPATCH_LEVEL in a real-world test, then that’s a GOOD thing. Except, as I previously noted, for stress/edge-condition testing (which is really not about real-world, but about the behavior of your driver in extreme circumstances to ensure that it will fail predictably gracefully) over-burdening the system at IRQL DISPATCH_LEVEL and expecting your driver to work AND the system to make progress is no more reasonable then firing up your driver and beating the computer with a sledge hammer and expecting your driver to continue to receive packets as the box and mainboard are shattered to bits. Right… that won’t work. OK, very nice… but it really doesn’t tell us anything about how the driver will work or fail when any customer will use it.

Peter
OSR

Peter’s reply hit most of the interesting points; I would only add
that it might make sense to think about batching requests as much as
possible, so you can limit the number of “transactions” your driver
has to perform. In super-high-bandwidth situations, you may have to
accept greater delay (and delay variance), in the form of queuing
delay, in order to keep moving bits. Think of it as analogous to
trying to limit context switches.

Then again, I don’t know anything about your hardware, either; you
may have to profile it to see how it performs. But if you can perhaps
buffer more data before you actually run your DPC, and therefore run
your DPC fewer times, you might be ahead.

The other question I would have is packet size - it’s relatively easy
to kill network performance by flooding a link with a bunch of small
packets. If you’re curious about tuning to real Internet traffic,
there are lots of reports about traffic distributions in the
literature. There tends to be a bi- or tri-modal distribution, with
lots of 40-60 byte packets and lots of 1460-1500 byte packets, with a
third peak somewhere in the middle.

You should probably set performance goals for yourself in terms of
bps and pps, and maybe in terms of per-packet delay and delay
variance, if real-time data is important to you, and then test to them.

Oh, another thing - you’re ignoring the application end of things.
Unless this is designed for use only in routers, you’re eventually
going to have to get data to usermode and back (well, or a TDI
client). The performance bottleneck can easily wind up there, and the
flow control that results from e.g. TCP based on the app’s ability to
handle data can turn out to be the limiting factor. If you have a
particular app in mind, you should definitely be testing with it as
well. A great many apps use berkeley-style socket I/O, which makes it
hard to get optimal network performance out of Windows.

Anyway, good luck.

-Steve

On Oct 11, 2006, at 5:57 AM, Jan Bottorff wrote:

I was shown a curious issue today, and am looking for feedback on
how people have handled similar situations.

The basic issue seems to be as the data rate on a driver increases,
it eventually consumes 100% of the cpu in it?s DPC processing. This
seems to cause cpu starvation for all passive_level threads, which
causes bad things to happen. I?ve suggested that either the
hardware needs some sort of interrupt rate tuning capability (so
the DPC can notice it?s hogging the cpu, and give it up for less
than the system time slice <10 milliseconds>), or else it needs
enough buffers queued to the hardware that the DPC can stop
executing for a full time slice without overflowing the buffers. I
suppose another option might be when loads get high to stop
processing in dispatch_level, and activate a passive_level thread
to do the processing and adjust the thread priority to something
appropriate (perhaps dynamically). The thread scheduler would then
essentially arbitrate between other threads and data rates.

The device looks like an NDIS miniport on the top, but has WDM out
the bottom. Our testing group is able to generate packet rates high
enough to essentially flood the driver with data (the actual
hardware data channel is real fast). The testing group likes to
install the packet bridge between two of these and insert the whole
path between groups of machines that generate and consume packets,
so there is no TCP flow control to limit things. I suspect using
the bridge causes the received packets to immediately get
transmitted, all at dispatch_level. I suggested to the test group
that the data rate would get limited if the machine was doing any
real work, and using it as a 2 port network switch is not exactly a
normal operational scenario. Their reply was that if they do the
same test with more normal network devices, they don?t consume 100%
of the cpu at DPC level. It might just be our drivers are not yet
as performance tuned as the ones running on millions of systems,
and when they are, there will not be much different in behavior.
Still, I?m actually happy to see our testing group find ways to
break things. I tend to feel if they can?t break it, they are just
not testing it deeply enough.

Any thoughts?

  • Jan

Questions? First check the Kernel Driver FAQ at http://
www.osronline.com/article.cfm?id=256

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer