Is the device directly on the PCIe bus? Or is the device a child of a virtual bus driver?
Reviewing your crash stack trace again…
nt!KiPageFault
ndis!ndisSortNetBufferLists
ndis!ndisMDispatchReceiveNetBufferLists
ndis!ndisMDispatchReceiveNetBufferListsWithLock
ndis!ndisDoPeriodicReceivesIndication
ndis!ndisPeriodicReceivesWorker
This looks suspiciously like code that indicates deferred receive packets with a lock held, a lot like the description you quoted for slow mode.
A debugging suggestion would be to put a breakpoint on ndisMDispatchReceiveNetBufferListsWIthLock, and see what the flow looks like under normal conditions. This would take digging through the assembler. The goal is to understand if this is indicating YOUR packets, and if so can you understand how to find the list of NBLs when it crashes. If the chain of NBls is damaged, if you can find the head of the chain and ask !ndiskd.nbl to walk the chain, you can perhaps see what’s wrong.
The prototype for the indicate function:
VOID NdisMIndicateReceiveNetBufferLists(
In NDIS_HANDLE MiniportAdapterHandle,
In PNET_BUFFER_LIST NetBufferLists,
In NDIS_PORT_NUMBER PortNumber,
In ULONG NumberOfNetBufferLists,
In ULONG ReceiveFlags
);
Taking a random intuitive stab at it, perhaps when you indicate the receives, the number of NBLs parameter is correct, but the last NBL has a non-null next field, and this deferred processing ignores the element count and just runs until next is null, and the last NBL has a leftover link to something invalid. Or since it’s a null reference at offset 0x0D, perhaps the number of NBLs does not match the number of entries in the list, and this deferred code tries to process the number of NBLs you told it and faults when it tries to access a null next NBL.
I see in the crashdump there are a number of register with kernel address like values. You might see if any of those is an NB or NBL by feeding them to !ndiskd.nbl.
I suppose there is the question, is it possible you are indicating an empty list of NBLs under some condition? Like you process receives in batches, but under some conditions you end up with an empty batch.
I assume you have turned on driver verifier with NDIS checking enabled, and turned the flags up pretty far? This tends to a do a bunch of validation of data structures. Even better is the checked NDIS.sys, but that’s trickier, and on some OS versions, like Win 10, seems unavailable. Since you are NDIS 6.2, you could run with the checked Server 2012 R2 build.
Running and passing the WHQL/HLK tests is quite useful for NIC drivers, even before you’re ready to get certification. It would be less painful to setup the HLK if MSFT just shipped a preconfigured VM.
Since you only see the issue after many enable/disable cycles, you might try changing the behavior of the system, like run it with a single core enabled, or run it on a system with lots of and lots of cores. This tends to alter synchronization behavior. I believe win 10 driver verifier has synchronization fuzzing as an option too, and your driver should work on Win 10.
You might also turn on NDIS ETW/WPP tracing, and see if there are useful messages shortly before the crash.
Jan
On 9/18/16, 1:40 AM, “xxxxx@barco.com” wrote:
[MNA-180_Device.NT]
Characteristics = 0x1 ; NCF_VIRTUAL
should be:
[MNA-180_Device.NT]
Characteristics = 0x4 ; NCF_PHYSICAL
BusType = 0x05; PCIBus
Could this lead to the sort of crash we experienced ?
Thanks,
- Bernard Willaert
—
NTDEV is sponsored by OSR
Visit the list online at: http:
MONTHLY seminars on crash dump analysis, WDF, Windows internals and software drivers!
Details at http:
To unsubscribe, visit the List Server section of OSR Online at http:</http:></http:></http:>