I’m new to window drivers development and I’m currently trying to get into an existing driver’s bug that I have to resolve.
Here’s the basic design:
We have custom PCIe HW card with a DMA core on it which is responsible for streaming data into the PC’s memory via the PCIe bus.
As soon as a chunk of new data is copied by the external DMA core, an MSI interrupt is generated and the driver’s MSI ISR wakes up.
The MSI ISR only reads an internal register of our card to distinguish between a nominal-data-ready interrupt and some other error interrupts,
and if the interrupt is a data-ready one, it calls the WdfDpcEnqueue routine to handle an event to the DPC.
The DPC, in turn, is responsible for copying the data from the memory buffer filled by the DMA core, to a larger cyclic buffer to allow the procedure to repeat without overriding the data in the buffer.
The problem I’m facing is that once in a while (~once a day, under stress conditions) data in the memory buffer is overridden.
I’ve instrumented my code using the TraceLogging API to insert a trace event at each point of the process (MSI ISR start, MSI ISR end, DPC start, DPC end, Memory buffer overrun, etc’) in order to understand whether I’m loosing interrupts or DPC’s or any other problem.
the MSI ISR is running but the matching DPC isn’t called, and when the next MSI ISR is running and its matching DPC is called, the data in the buffer is already overridden.
My question is: what could possibly prevent a DPC from running?
How can I continue my debugging session from here?
One common cause of this is that you get two interrupts before your DPC is able to start up. Many driver writers assume one interrupt == one DPC, and that’s not necessarily true. Your DPC needs to be able to handle whatever work is outstanding, even if that means handling two interrupts.
It’s not relevant to your overrun problem, but msi interrupts let you design your device interface so that you should never have to read any device registers to understand which interrupt it is. One msi interrupt for data, a different msi interrupt for errors.
My problem doesn’t seem to be the one pointed by Tim_Roberts, since the DPC processes all the data in the buffer, regardless if the data in the buffer was produced by 2 consequent interrupts or by only one.
At the end of DPC we re-arm the interrupt, so interrupts flow one after another with at least 50msec between each other.
I wish I could attach a screenshot of the Windows Performance Analyzer plot to show you the behavior I see:
Since all of my routines are wrapped with start and end trace events, I can tell that if most of the time I see an ISR followed by a DPC immediately (a few usecs), once in a while I see a DPC that comes ~60msec after the ISR is ended without any apparent reason (DPC isn’t running when the new ISR is received, no another close ISR is received, etc).
The delay I see is between the end of the ISR and the beginning of the DPC (hence, I guess it is spends time somewhere in OS domain where I can’t access nor debug anything).
Please let me know if any of my assumptions are wrong!
Questions:
Is there a way to debug the windows dispatcher to see what was running between my ISR and the DPC?
Since this is a dedicated PC with a custom image of window 10, and it shouldn’t run any process like web browsers or any other user applications, what could possibly cause windows to prefer some other entity in the system over my DPC?
How can I force windows to handle my DPC in the highest priority immediately after the ISR without any delays?
Could the DPC queue be already packed with other DPC jobs? Is there a way to have a DPC queue dedicated only to my DPC job to be handled first?
I wish I could attach a screenshot of the Windows Performance Analyzer plot to show you the behavior I see
The forum supports that… just cut/paste it!
I can tell that if most of the time I see an ISR followed by a DPC immediately (a few usecs), once in a while I see a DPC that comes ~60msec after the ISR is ended
Yup… This is classic Windows behavior. We’ve been discussing it, and what to do about it, for (quite literally) decades. Like I tell my students/clients, Windows’ average DPC latency is very, very good. But it’s the worst case latency that can kill you.
Though, I have to say… sixty MILLISECONDS is really, really, really, huge. And I can’t think of ANYTHING that could account for this, and I really think you’ve got some sort of measurement error. I’ve seen ONE millisecond latencies ISR to DPC latency before, but that was years ago and on a slow processor (and it was due to a combination of a network driver interrupting, a video driver interrupting, and a ton of nasty page-faults).
There are a very limited number of things that can increase your DPC latency so drastically:
interrupts, between the time you queue your DPC and when your DPC does useful work
other DPCs that are queued at the head of your DPC.
You should be able to see what’s going on in XPerf… which you are already using.