Hi Guys,
I am hitting a very interesting problem. Before I go in to the problem, here is a bit about my driver.
KMDF driver which uses IOCTLs to communicate with hardware.
Uses Scatter Gather DMA.
Does not throttle the application in any way. If the hardware has space, the driver will fire the command to the hardware. Essentially the driver does not put artificial limits to what application can do. In a way the driver just provides a transport.
The command flow looks like DeviceIOControl->Driver->FireCommand->Interrupt->DPC [Complete Commands and Pull Command From KMDF queue and fire them].
So the DPC is doing to major things here. First it is completing the commands for which it got the completion from hardware and then it getting more commands from the KMDF queues and firing them to hardware to keep the hardware busy.
What we are seeing is that till the time we have 100 threads, nothing bad happens. After this when we start increasing the load by increasing the number of threads, we hit
How are multiple threads being introduced here? Usually, multiple simultaneous I/Os are just done using overlapped I/O in a single thread or a few threads.
Just wanted to give everyone an update on this issue. I debugged this issue and instrumented the driver ISR and DPC. It turns out that the problem was a interlocked list.
My driver uses a interlocked list between the ISR and DPC to communicate the interrupt vector which needs processing. As the number of threads increase the number if interrupts would increase and the amount of time the driver was spending contending for this interlocked list would increase as well. Once I removed this list, everything looks great.
The idea behind this list was to make sure that we only process the interrupts that are needed and nothing else. It turns out that it is more optimal to read a bunch of states rather than just contending for a lock.
The point here being, that the time you spend spinning on the lock at IRQL DPC_LEVEL is going to be counted against you by the watchdog timer.
Now... having said that... I am pretty surprised to hear that you're spending considerable time contending on ExInterlockedRemoveXxxxList. That's a pretty special, and a highly optimized, function. So... hmmm...
I think I did not do a good job at explaining this properly. This problem is not because of the ExInterlockedXXX functions being inefficient.
I see that driver gets in to a situation where the DPC is running and hardware continuously generates the interrupt. This will cause DPC to wait again and again and the circular queue will be a never ending queue (as the ISR would always keep inserting in to the queue). DPC will get preempted by ISR every time there is a interrupt. The cumulative time to content on this lock becomes big part of DPC [Not because the Interlocked insert/remove is inefficient] just because the interrupts are so fast that ISR is almost always grabbing that lock.
I should have had a heuristic to return from DPC after processing some completions. But, that model comes down to the what I am doing now. Every time there is a interrupt, I schedule the DPC and in DPC I go and check my hardware and process all the completions which are pending in the queue. After this is done, I use a worker thread to fire next set of commands. There is no lock anymore.
Just as an aside... As both Mr. @Tim_Roberts and I have observed previously, at some point, it is far more efficient to not use interrupts at all and instead poll for completed transfers. Seriously.
I know this goes against our experience, and perhaps even what we were taught in university, but as throughput rates (messages per second) get higher, polling becomes more efficient. The driver for a super-high throughput FPGA that I worked on last year shut off interrupts entirely.
And if your data rate has bursts or surges and goes up and down a lot, a hybrid of interrupts and polling. Interrupts for when the data rate is low, and polling for when it is high. Careful transitions are important of course