Measuring the DPC and SINGLE_DPC_TIMEOUT_EXCEEDED

We have encountered an ASSERTION,
.
Assertion: *** DPC execution time exceeds system limit
This is NOT a break in update time
This is a BUG in a DPC routine
Perform a stack trace to find the culprit
The time out limit will be doubled on continuation
Use gh to continue!!
.
DPC_TIMEOUT_TYPE: SINGLE_DPC_TIMEOUT_EXCEEDED
DPC_RUNTIME: 501
DPC_TIME_LIMIT: 500
FAULTING_IP:
nt!KeAccumulateTicks+18756a
fffff805`7a8b862a cd2c int 2Ch
ERROR_CODE: (NTSTATUS) 0xc0000420 - An assertion failure has occurred.
EXCEPTION_CODE: (NTSTATUS) 0xc0000420 - An assertion failure has occurred.
EXCEPTION_CODE_STR: c0000420

ffffc601e2aa1bb0 fffff80112715a4f : 0001311fcccea0dd ffffc601e2a85180 0000000000000286 000000000055d1bd : nt!KeAccumulateTicks+0x18756a
ffffc601e2aa1c10 fffff8011260847c : 0000000000000000 ffff9a0313ab8b00 ffff838137e5a680 ffff9a0313ab8bb0 : nt!KeClockInterruptNotify+0xcf
ffffc601e2aa1f30 fffff80112780a15 : ffff9a0313ab8b00 0000000000000000 0000000000000001 ffffc601e2a85180 : hal!HalpTimerClockIpiRoutine+0x1c
ffffc601e2aa1f60 fffff8011284f92a : ffff838137e5a680 ffff9a0313ab8b00 ffff9a070ba97000 ffff9a0313ab8b00 : nt!KiCallInterruptServiceRoutine+0xa5
ffffc601e2aa1fb0 fffff8011284fe77 : 0000000004d11000 0000000000000000 0000000000000002 0000000000000000 : nt!KiInterruptSubDispatchNoLockNoEtw+0xfa
ffff838137e5a600 fffff8011271edd1 : 0000000000000000 0000000000010008 0000000000000460 ffff9a070737a000 : nt!KiInterruptDispatchNoLockNoEtw+0x37
ffff838137e5a790 fffff801129b942d : ffff9a0313aa0500 fffff801129b7eef 0000000000000460 ffff9a070ba96ba0 : nt!KiAcquireKobjectLockSafe+0x31
ffff838137e5a7c0 fffff801128a9d57 : 0000000000000460 0000000000000000 0000000000000460 0000000000000000 : nt!ExpCheckForWorker+0xf9
ffff838137e5a820 fffff801129baf5c : ffff9a031340b000 0000000000000000 000000005843424e fffff8011272b023 : nt!ExpFreePoolChecks+0x170c1b
ffff838137e5a850 fffff801129eb9c2 : ffff9a031340b000 ffff9a03c76f0b70 0000000000000000 ffff800000000000 : nt!ExpFreeHeapSpecialPool+0x54
ffff838137e5a8a0 fffff80f123a8bf1 : ffff9a070ba96ba0 ffff9a070ba96ba0 000044da05c7ef00 ffff9a03192f3dc8 : nt!ExFreePoolWithTag+0x3992
ffff838137e5aa50 fffff80112f447d7 : ffff9a070ba96ba0 ffff9a03c76f0b70 ffff9a070ba96ba0 fffff80f127b7665 : VerifierExt!ExFreePoolWithTag_wrapper+0x11
ffff838137e5aa80 fffff80112f4474c : ffff9a031fdc8580 0000000000000000 0000000000000000 ffff9a03732e3930 : nt!VerifierExFreePoolWithTag+0x57
ffff838137e5aab0 fffff8011042bbfa : 0000000000000000 ffff9a070ba96ba0 ffff9a031fdc8580 ffff9a031bba3000 : nt!VerifierExFreePool+0x1c
.
Q1. It is not a Bug Check 0x133 DPC_WATCHDOG_VIOLATION but an ASSERTION. Not sure why bug check did not happen ?
.
Started debugging and want to measure the time taken in each DPC (Average, Min and the Max values)
So, added a KeQueryPerformanceCounter() at the beginning and at the end of the DPCHandler took the difference and divided by PerformanceFrequency to obtain the time taken.
The min value was around 20us but the max value obtained was 2Secs, and there are few completions finishing in millisecs.
Later, realized that this may not be the correct way to measure DPC and it might include the time the DPC gets interrupted by the ISR.
.
Found the following ways to measure the DPC,
a) https://docs.microsoft.com/en-us/windows-hardware/drivers/devtest/example-15--measuring-dpc-isr-time
b) xperf from Windows SDK(Windows Performance Toolkit)
.
From the ETL log the following is observed, 649.477935200(Fragment Enter), 650.533282500(Fragment Exit), 1,057.932500(Exclusive Duration)
It looks like the xperf/tracelog both using a similar logic of have a KeQueryPerformanceCounter() at the entry and exit and measuring the difference.
Also, to note that in this run there is no ASSERTION or Bug Check, which means the DPC/ISR did not actually took 1Sec or exceed the limit.
It appears to be that the DPC got interrupted and got the control back after 1 sec.
In other runs, the max time went up to 2 and 3secs.
.
Q2. Is the understanding of taking more time is correct i.e., is it becausAe the DPC is interrupted and then resumed back after the elapsed time and hence showing more time in the tools ?
.
The original issue is longer time taken by the DPC, which is under investigation and in process to understand where the driver is spending time.
Q3. If the DPC is waiting for the spinlock (KeAcquireInStackQueuedSpinLockAtDpcLevel), Will this waiting time considered for the Watchdog ticks ?
.

Q1. It is not a Bug Check 0x133 DPC_WATCHDOG_VIOLATION but an ASSERTION. Not sure why bug check did not happen ?

Because the debugger is attached, would be my guess.

Q2. Is the understanding of taking more time is correct i.e., is it becausAe the DPC is interrupted and then resumed back after the elapsed time and hence showing more time in the tools ?

I’m not sure I fully understand your scenario. But, yes… if your DPC is interrupted then (as far as I understand it) the time taken in that ISR will be counted against your DPC for the purposes of the Watchdog timer.

The min value was around 20us but the max value obtained was 2Secs,

Crazy, huh? Yeah… I bet if you computed it, you’d find the AVERAGE latency to be pretty short as well. But 2 Secs is unreasonably long… something bad is definitely happening.

What kind of device is this, by the way? A GPIO device?? Anything unusual about your device, your ISR, or your DPC that you can share???

Q3. If the DPC is waiting for the spinlock (KeAcquireInStackQueuedSpinLockAtDpcLevel), Will this waiting time considered for the Watchdog ticks ?

Yes. Definitely.

Peter

@“Peter_Viscarola_(OSR)” said:

Q1. It is not a Bug Check 0x133 DPC_WATCHDOG_VIOLATION but an ASSERTION. Not sure why bug check did not happen ?

Because the debugger is attached, would be my guess.

Q2. Is the understanding of taking more time is correct i.e., is it becausAe the DPC is interrupted and then resumed back after the elapsed time and hence showing more time in the tools ?

I’m not sure I fully understand your scenario. But, yes… if your DPC is interrupted then (as far as I understand it) the time taken in that ISR will be counted against your DPC for the purposes of the Watchdog timer.

This is pretty old regular NIC driver and HW. The HW interrupts and in the ISR the corresponding DPC is queued and will handle accordingly. The DPC runs a loop and consume messages. The max messages the DPC can process is set to 256 before yielding and from the debug logs it is observed that for most of the time when the 256 message limit is hit the time taken is around 2msecs(max) and for this unusual long time in DPC the number of messages processed ranges from 4-24.
I have just started using xperf and with the limited knowledge we could see at the time of issue, only the DPC is running and there is no call to the ISR. Also, checked in the CPU timeline there is no other driver being executed which appears to be like all the 2secs is being used by the DPC itself, strange no ASSERT/Bug Check. But, the observations might we very well wrong(will need to learn how to study xperf etl in detail and efficiently)

The min value was around 20us but the max value obtained was 2Secs,

Crazy, huh? Yeah… I bet if you computed it, you’d find the AVERAGE latency to be pretty short as well. But 2 Secs is unreasonably long… something bad is definitely happening.

Very true from the xperf etl log, the driver’s average is 245us and the rogue DPC handler is 86us and among other DPC handlers the max average is 1msec.

What kind of device is this, by the way? A GPIO device?? Anything unusual about your device, your ISR, or your DPC that you can share???

This is a regular PCI-NIC device nothing unusual. Only the test case is newer i.e., the customer is using his application with Bi-Directional traffic. The assert if hits will be after 8-10hrs of running the traffic and it is very inconsistent. The test ran multiple 48+hrs without any ASSERT.

Q3. If the DPC is waiting for the spinlock (KeAcquireInStackQueuedSpinLockAtDpcLevel), Will this waiting time considered for the Watchdog ticks ?

Yes. Definitely.

Hmm, then probably the DPC is probably waiting long for the spinlock to be acquired.

Peter

The assert if hits will be after 8-10hrs of running the traffic and it is very inconsistent. The test ran multiple 48+hrs without any ASSERT.

What you’re seeing, in my experience, is CLASSIC Windows behavior. Average ISR to DPC latency, and average DPC runtime, can be low… but every once in a while… things can get out of hand. We’ve had clients coming to us complaining of strange problems like this that happen “once a day, overnight, sometimes” – It’s one of the least attractive things of kernel-mode programming on Windows.

All I can advise is that (1) if this is a closed system that you’re building, identify the offending component(s) and get them off the system, or (2) if your hardware and driver can be installed in a wide-variety of systems, you’re just going to have to “suck it up” and provide mitigations… like queuing a work item. I know, that’s not super helpful. Sorry.

Peter

@“Peter_Viscarola_(OSR)” said:

The assert if hits will be after 8-10hrs of running the traffic and it is very inconsistent. The test ran multiple 48+hrs without any ASSERT.

What you’re seeing, in my experience, is CLASSIC Windows behavior. Average ISR to DPC latency, and average DPC runtime, can be low… but every once in a while… things can get out of hand. We’ve had clients coming to us complaining of strange problems like this that happen “once a day, overnight, sometimes” – It’s one of the least attractive things of kernel-mode programming on Windows.

All I can advise is that (1) if this is a closed system that you’re building, identify the offending component(s) and get them off the system, or (2) if your hardware and driver can be installed in a wide-variety of systems, you’re just going to have to “suck it up” and provide mitigations… like queuing a work item. I know, that’s not super helpful. Sorry.

No, it is not a closed system and it open to work on wide-variety of systems. Sure, We would continue to work on this issue.
Thank you very much for your inputs on the questions asked in earlier thread, that was helpful in understanding .

Peter