I was shown a curious issue today, and am looking for feedback on how people
have handled similar situations.
The basic issue seems to be as the data rate on a driver increases, it
eventually consumes 100% of the cpu in it’s DPC processing. This seems to
cause cpu starvation for all passive_level threads, which causes bad things
to happen. I’ve suggested that either the hardware needs some sort of
interrupt rate tuning capability (so the DPC can notice it’s hogging the
cpu, and give it up for less than the system time slice <10 milliseconds>),
or else it needs enough buffers queued to the hardware that the DPC can stop
executing for a full time slice without overflowing the buffers. I suppose
another option might be when loads get high to stop processing in
dispatch_level, and activate a passive_level thread to do the processing and
adjust the thread priority to something appropriate (perhaps dynamically).
The thread scheduler would then essentially arbitrate between other threads
and data rates.
The device looks like an NDIS miniport on the top, but has WDM out the
bottom. Our testing group is able to generate packet rates high enough to
essentially flood the driver with data (the actual hardware data channel is
real fast). The testing group likes to install the packet bridge between two
of these and insert the whole path between groups of machines that generate
and consume packets, so there is no TCP flow control to limit things. I
suspect using the bridge causes the received packets to immediately get
transmitted, all at dispatch_level. I suggested to the test group that the
data rate would get limited if the machine was doing any real work, and
using it as a 2 port network switch is not exactly a normal operational
scenario. Their reply was that if they do the same test with more normal
network devices, they don’t consume 100% of the cpu at DPC level. It might
just be our drivers are not yet as performance tuned as the ones running on
millions of systems, and when they are, there will not be much different in
behavior. Still, I’m actually happy to see our testing group find ways to
break things. I tend to feel if they can’t break it, they are just not
testing it deeply enough.
Any thoughts?
- Jan