Driver hits : DPC_TIMEOUT_TYPE: SINGLE_DPC_TIMEOUT_EXCEEDED

Hi Guys,
I am hitting a very interesting problem. Before I go in to the problem, here is a bit about my driver.

  1. KMDF driver which uses IOCTLs to communicate with hardware.
  2. Uses Scatter Gather DMA.
  3. Does not throttle the application in any way. If the hardware has space, the driver will fire the command to the hardware. Essentially the driver does not put artificial limits to what application can do. In a way the driver just provides a transport.
  4. The command flow looks like DeviceIOControl->Driver->FireCommand->Interrupt->DPC [Complete Commands and Pull Command From KMDF queue and fire them].
  5. So the DPC is doing to major things here. First it is completing the commands for which it got the completion from hardware and then it getting more commands from the KMDF queues and firing them to hardware to keep the hardware busy.
  6. What we are seeing is that till the time we have 100 threads, nothing bad happens. After this when we start increasing the load by increasing the number of threads, we hit

BUGCHECK_CODE: 133
BUGCHECK_P1: 0
BUGCHECK_P2: 500
BUGCHECK_P3: 500
BUGCHECK_P4: fffff806b23c33a0
DPC_TIMEOUT_TYPE: SINGLE_DPC_TIMEOUT_EXCEEDED

I can think of two ways to reduce the time spent in DPC.

  1. Take command firing out of DPC and use a passive level work item to fire next set of commands. This I know will save a lots of cycles.
  2. Somehow put a limit on how many commands can be fired at any given time.

Any thoughts are highly appreciated!
-Aj

How are multiple threads being introduced here? Usually, multiple simultaneous I/Os are just done using overlapped I/O in a single thread or a few threads.

It’s using overlapped IO! I am not sure how many threads they are using to do this. I think 128 threads are being deployed for thsi test.

Hi All,

Just wanted to give everyone an update on this issue. I debugged this issue and instrumented the driver ISR and DPC. It turns out that the problem was a interlocked list.

My driver uses a interlocked list between the ISR and DPC to communicate the interrupt vector which needs processing. As the number of threads increase the number if interrupts would increase and the amount of time the driver was spending contending for this interlocked list would increase as well. Once I removed this list, everything looks great.

The idea behind this list was to make sure that we only process the interrupts that are needed and nothing else. It turns out that it is more optimal to read a bunch of states rather than just contending for a lock.

Hope this finding saves someone some time.

Thanks,
Aj

1 Like

Great observation, actually.

The point here being, that the time you spend spinning on the lock at IRQL DPC_LEVEL is going to be counted against you by the watchdog timer.

Now... having said that... I am pretty surprised to hear that you're spending considerable time contending on ExInterlockedRemoveXxxxList. That's a pretty special, and a highly optimized, function. So... hmmm...

Hi Peter,

I think I did not do a good job at explaining this properly. This problem is not because of the ExInterlockedXXX functions being inefficient.

I see that driver gets in to a situation where the DPC is running and hardware continuously generates the interrupt. This will cause DPC to wait again and again and the circular queue will be a never ending queue (as the ISR would always keep inserting in to the queue). DPC will get preempted by ISR every time there is a interrupt. The cumulative time to content on this lock becomes big part of DPC [Not because the Interlocked insert/remove is inefficient] just because the interrupts are so fast that ISR is almost always grabbing that lock.

I should have had a heuristic to return from DPC after processing some completions. But, that model comes down to the what I am doing now. Every time there is a interrupt, I schedule the DPC and in DPC I go and check my hardware and process all the completions which are pending in the queue. After this is done, I use a worker thread to fire next set of commands. There is no lock anymore.

Thanks,
Aj

Thanks for the excellent reply and clarification.

Just as an aside... As both Mr. @Tim_Roberts and I have observed previously, at some point, it is far more efficient to not use interrupts at all and instead poll for completed transfers. Seriously.

I know this goes against our experience, and perhaps even what we were taught in university, but as throughput rates (messages per second) get higher, polling becomes more efficient. The driver for a super-high throughput FPGA that I worked on last year shut off interrupts entirely.

1 Like

And if your data rate has bursts or surges and goes up and down a lot, a hybrid of interrupts and polling. Interrupts for when the data rate is low, and polling for when it is high. Careful transitions are important of course

1 Like

That would be something that I would love to Try.
Thanks guys for all the suggestions.