Debugging NdisTimedDataHang reported by Driver Verifier

Mahesh · April 23, 2019, 8:35am

I have enabled NDIS/WIFI verification flag for my driver in Driver Verifier. This resulted in BSOD for hitting the ndistimeddatahang rule. When I analyzed the dump, I got -

DRIVER_VERIFIER_DETECTED_VIOLATION (c4)
Arguments:
Arg1: 000000000009200f, ID of the 'NdisTimedDataHang' rule that was violated.
Arg2: fffff806cd819200, A pointer to the string describing the violated rule condition.
Arg3: ffff87862606b110, Address of internal rule state (second argument to !ruleinfo).
Arg4: ffff87862606b240, Address of supplemental states (third argument to !ruleinfo).

When I did !ndiskd.pendingnbls, I got the list of NBLs that are currently pending while the dump was taken. To figure out, which NBL has caused the violation, I tried to use !ruleinfo command with the arguments received in analyzing.

!ruleinfo 0x9200f 0xffff87862606b110 0xffff87862606b240

but Windbg reported the error -

Failed to read the rule state (check the second argument).

There are pending NBLs currently held by not my driver. I just want to make sure that violation is not caused by my driver. Can someone please suggest what am I doing wrong ? Is there any way to figure out which NBL failed to complete in 22 seconds which is a requirement for ndistimeddatahang rule ?

Jeffrey_Tippet_MSFT · April 23, 2019, 5:11pm

Sorry about !ruleinfo; I’ll look into why it doesn’t work. What OS version are you targeting, and which version of the debugger are you using? Check for an updated debugger; it might have a better !ruleinfo.

NdisTimedDataHang tends to be accurate for miniports. So in the meanwhile, I think it’s reasonable to run with the hypothesis that your miniport really has taken too long to process a Tx NBL. !pendingnbls has shown you some likely candidates. Typically, if you break in at a random moment when the computer is sitting mostly idle, you won’t see many NBLs in-flight. (When an NBL is transmitted, most hardware can round-trip it back to TCPIP within a few milliseconds. So unless you’re saturating the network with back-to-back transmits, it’s rare to catch an NBL “in the act” just by breaking into the kernel debugger.) So if you see a bunch in !pendingnbls, those are likely to all be questionable.

“Lost” or “stuck” transmits are a very common problem for NDIS miniport drivers, so it’s worth using this as an opportunity to add some debugging aids to your driver. I suggest you consider:

Add counters at every component boundary. Count the number of NBLs that go into your driver, count the number of NBLs that go out of your driver. Count the number of NBLs that go into hardware, etc. If you have separate subsystems in your driver (e.g., a USB wrapper or a HAL/PAL), then add counters across that boundary. Counters will help you narrow down who’s lost the NBL.
Audit every line of code where NBLs are queued. For example, if you need to put NBLs onto a “pending” list while doing an 802.11 roam operation, carefully scrutinize that code to ensure that the NBLs don’t get forgotten in any cases. Make sure the lists are safe against races when applicable.
Consider writing a central “NBL queue” datastructure whose job is to hold NBLs while you’re waiting for something, and who knows how to cancel NBLs if needed.
Stamp NBLs with a special signature whenever they pass across a component boundary. For example, you might write 0xffeeffee to NBL->MiniportReserved[1] when you get an NBL from NDIS, then write 0xeeffeeff when returning it. Then you can use windbg to search memory for NBLs that have your signature in them. Or for fancy: !ndiskd.nblpool -findnbl (@$extin).MiniportReserved[1]==0xffeeffee. (Note that !ndiskd.pendingnbls is basically doing this, but !pendingnbls can’t see into subsystems within your driver, so, e.g. it can’t distinguish between your NDIS edge and your HAL/PAL.)

If you do one thing, I’d go for counters. They’re easy to add, and once you trust your own NBL counters, you won’t be scratching your head trying to figure out what !pendingnbls is telling you.

Mahesh · April 23, 2019, 5:56pm

Thanks Jeffrey for detailed information on how to debug the issue. I will add necessary counters as suggested and see where the NBL leak is happening. My target application is on Windows 10 RS4 and debugging on host with Windbg version 10.0.17763.1.

BTW, the issue is happening only when device is in Connected Standby mode. Are you aware of any things that NDIS miniport virtual drivers should consider when device is in Connected Standby mode ?

Jeffrey_Tippet_MSFT · April 23, 2019, 10:10pm

I am not aware of anything super special about Connected Standby. From a miniport’s perspective, Connected Standby is basically just a regular OID_PNP_SET_POWER, except maybe with a slightly different set of Wake-on-LAN patterns set.

But OID_PNP_SET_POWER is traumatic and difficult for everyone, so it’s a typical place to lose track of a few NBLs. Make sure that there’s not a race between the power transition and the datapath: if an NBL sneaks in at just the wrong moment, it shouldn’t get stuck for the duration of the power transition.

Mahesh · May 3, 2019, 6:01pm

Thanks Jeffrey. I was able to root cause the issue where the driver leaked the NBLs. Additional logging and counters did help.