NDIS6.2 outstanding RCBs timeout ?

Bernard_Willaert-2 · August 24, 2016, 10:28am

We developed a NDIS miniport driver based on the sample in the Win7 DDK:
7600.16385.1\src\network\ndis\netvmini\6x to support our proprietary PCIe board that has 4 optical network interfaces.

We allocate a pool of 256 RCBs.
Sometimes we experience a hangup of the RX side, because it seems like all RCBs are in use = not returned by NDIS.
RXGetRCB fails to get an unused RCB from the pool.

Is there a possibility why NDIS does not return all RCBs via MPChar.ReturnNetBufferListsHandler ?
Is there a timeout on the use of an RCB, after which it can be considered free again ?

Thank you in advance for your time and effort,

Bernard Willaert
Barco, Healthcare Division
Belgium

Larry_Clawson · August 24, 2016, 5:48pm

Is NDIS holding these RCBs when you call NdisMIndicateReceiveNetBufferLists (), if so are you setting NDIS_RECEIVE_FLAGS_RESOURCES flag when you get low on RCBs?

Bernard_Willaert-2 · August 25, 2016, 10:23am

Thank you for your quick response, Larry!
We are not sure where/how RCBs are leaked. We assume that they are held by NDIS, but another option would be that they are never queued.
We are not setting the NDIS_RECEIVE_FLAGS_RESOURCES when running low on HW buffers.

I inherited the code from a colleague, which is more or less a copy of the DDK sample.
But - looking into it, a possible RCB leak could be in our version of MPReturnNetBufferLists:

original sample code snippet:

while (NetBufferLists)
{
PRCB Rcb = RCB_FROM_NBL(NetBufferLists);
ReturnRCB(Adapter, Rcb);
NetBufferLists = NET_BUFFER_LIST_NEXT_NBL(NetBufferLists);
}

our version:

while (NetBufferLists)
{
PRCB Rcb = RCB_FROM_NBL(NetBufferLists);
if (!Rcb)
{
LOG(LOG_WARNING, “Completion of a Rx packet from NDIS without hardware reference => exit completion”);
break;
}
RxReturnRCB(Adapter, Rcb);
NetBufferLists = NET_BUFFER_LIST_NEXT_NBL(NetBufferLists);
}

It doesn’t seem right to exit the loop when a NULL Rcb is detected, without returning possible others in the list ?

Is it possible that a NULL PRCB is returned in the NetBufferList ?
Or is it because we inserted a NULL pointer in the first place, when indicating it to NDIS ?

Thanks,

Bernard Willaert

Bernard_Willaert-2 · August 25, 2016, 10:50am

>> Or is it because we inserted a NULL pointer in the first place, when indicating it to NDIS ?

There is a check in RXReceiveIndicate, that will also exit the NBL loop, when a NULL RCB would be inserted:
//
// Collect pending NBLs, indicate up to MaxNblCountPerIndicate per receive block
//
for(NumNblsReceived=0; NumNblsReceived < AdapterDpc->MaxNblCountPerIndicate; ++NumNblsReceived)
{
PLIST_ENTRY Entry;
PRCB Rcb = NULL;

Entry = NdisInterlockedRemoveHeadList(&ReceiveBlock->ReceiveList, &ReceiveBlock->ReceiveListLock);
if(Entry)
{
Rcb = CONTAINING_RECORD(Entry, RCB, RcbLink);
}

if(!Rcb)
{
break;
}

So, the previous question remains: can NDIS return a NULL RCB ?

Larry_Clawson · August 25, 2016, 10:52am

Are you seeing the LOG_WARNING? That could be a leak, put a breakpoint on it and look at the NetBufferLists structure to see if there are more on the list.

Larry C

Bernard_Willaert-2 · September 15, 2016, 2:47am

I bring this issue back up because we did not succeed yet in solving it.

When we continuously enable/disable the network interface using a script with devcon, the test will fail about 1/500, always with an identical stack trace - pointing to ndisSortNetBufferList:

TRAP_FRAME: fffff880103954c0 – (.trap 0xfffff880103954c0)
NOTE: The trap frame does not contain all registers.
Some register values may be zeroed or incorrect.
rax=fffffa802ad15c50 rbx=0000000000000000 rcx=fffffa802bb85340
rdx=0000000000000000 rsi=0000000000000000 rdi=0000000000000000
rip=fffff880039533aa rsp=fffff88010395650 rbp=0000000000001000
r8=0000000000000000 r9=0000000000000000 r10=0000000000000000
r11=0000000000000000 r12=0000000000000000 r13=0000000000000000
r14=0000000000000000 r15=0000000000000000
iopl=0 nv up ei pl zr na po nc
ndis!ndisSortNetBufferLists+0x1b6:
fffff880039533aa 8a420d mov al,byte ptr [rdx+0Dh] ds:000000000000000d=??
Resetting default scope

LAST_CONTROL_TRANSFER: from fffff80003ec89a9 to fffff80003ec9400

STACK_TEXT:
fffff88010395378 fffff80003ec89a9 : 000000000000000a 000000000000000d 0000000000000002 0000000000000000 : nt!KeBugCheckEx
fffff88010395380 fffff80003ec7620 : fffffa802c569010 fffff88010395510 fffffa802c569010 0000000000001000 : nt!KiBugCheckDispatch+0x69
fffff880103954c0 fffff880039533aa : fffffa802cc061a0 fffff8800396576c 0000000000000000 fffffa802a80a340 : nt!KiPageFault+0x260
fffff88010395650 fffff88003953931 : fffff88010395730 0000000000000000 fffffa802ad15b20 0000000000000000 : ndis!ndisSortNetBufferLists+0x1b6
fffff880103956f0 fffff880038ec655 : fffffa802cc0c1a0 fffffa802ad15b20 fffffa8000000000 0000000000000001 : ndis!ndisMDispatchReceiveNetBufferLists+0x121
fffff88010395a40 fffff880038ebc35 : fffffa802cc0c1a0 fffffa802ad15b20 0000000000000000 0000000000000001 : ndis!ndisMDispatchReceiveNetBufferListsWithLock+0xb1
fffff88010395ab0 fffff880038ebe55 : fffffa802cc0c1a0 fffffa802ad15b20 fffffa802cc0c1a0 fffffa802cc0d300 : ndis!ndisDoPeriodicReceivesIndication+0x28d
fffff88010395b30 fffff880038ec923 : fffffa802cc0c1a0 0000000000000002 0000000000000000 fffffa802cc0d388 : ndis!ndisPeriodicReceivesWorker+0x91
fffff88010395b70 fffff80004161bc6 : 0000000000000008 fffffa8020416b50 0000000000000080 0000000000000000 : ndis!ndisReceiveWorkerThread+0x2ab
fffff88010395c00 fffff80003ebb6a6 : fffff80004048e80 fffffa8020416b50 fffff80004056cc0 0000000000000000 : nt!PspSystemThreadStartup+0x5a
fffff88010395c40 0000000000000000 : fffff88010396000 fffff88010390000 fffff880103958c0 0000000000000000 : nt!KxStartSystemThread+0x16

Are we correct in assuming that we feed NDIS a netbufferlist that is corrupt?
We also assume that this happens during disable of the interface, because this is always what we see as last log outputs from the driver:
[MNA-180 NDIS] [FFFFFA801D220040] —> MPPause
[MNA-180 NDIS] [FFFFFA801D220040] —> MPRestart
[MNA-180 NDIS] [FFFFFA801D220040] —> MPPause
[MNA-180 NDIS] [FFFFFA801D220040] —> MPRestart
[MNA-180 NDIS] [FFFFFA801D220040] —> MPPause
[MNA-180 NDIS] [FFFFFA801D220040] —> MPHaltEx (enters here for reason 0)
[MNA-180 NDIS] [FFFFFA801D220040] —> MPShutdownEx
[MNA-180 NDIS] HwFreeReceiveEngine pHw->nBusyReceive=[0]

We do not indicate any new NBLs to NDIS from the moment this fails:
if(!MP_IS_READY(pAdapter)) with :

Any ideas on how to debug this further are greatly appreciated !!

Thank you in advance,

Bernard Willaert

Scott_Noone_OSR · September 15, 2016, 1:50pm

Have you tried this with Verifier enabled on your driver and on NDIS? It
might take much longer to happen (or never at all) if it’s timing related,
but you might get a better dump. A checked build of NDIS might also be
useful and/or interesting.

If you can put the dump someplace I can poke at it and try to provide any
leads. Not sure I’ll get anywhere, but always up for a challenge

-scott
OSR
@OSRDrivers

wrote in message news:xxxxx@ntdev…

I bring this issue back up because we did not succeed yet in solving it.

When we continuously enable/disable the network interface using a script
with devcon, the test will fail about 1/500, always with an identical stack
trace - pointing to ndisSortNetBufferList:

TRAP_FRAME: fffff880103954c0 – (.trap 0xfffff880103954c0)
NOTE: The trap frame does not contain all registers.
Some register values may be zeroed or incorrect.
rax=fffffa802ad15c50 rbx=0000000000000000 rcx=fffffa802bb85340
rdx=0000000000000000 rsi=0000000000000000 rdi=0000000000000000
rip=fffff880039533aa rsp=fffff88010395650 rbp=0000000000001000
r8=0000000000000000 r9=0000000000000000 r10=0000000000000000
r11=0000000000000000 r12=0000000000000000 r13=0000000000000000
r14=0000000000000000 r15=0000000000000000
iopl=0 nv up ei pl zr na po nc
ndis!ndisSortNetBufferLists+0x1b6:
fffff880039533aa 8a420d mov al,byte ptr [rdx+0Dh] ds:000000000000000d=??
Resetting default scope

LAST_CONTROL_TRANSFER: from fffff80003ec89a9 to fffff80003ec9400

STACK_TEXT:
fffff88010395378 fffff80003ec89a9 : 000000000000000a 000000000000000d
0000000000000002 0000000000000000 : nt!KeBugCheckEx
fffff88010395380 fffff80003ec7620 : fffffa802c569010 fffff88010395510
fffffa802c569010 0000000000001000 : nt!KiBugCheckDispatch+0x69
fffff880103954c0 fffff880039533aa : fffffa802cc061a0 fffff8800396576c
0000000000000000 fffffa802a80a340 : nt!KiPageFault+0x260
fffff88010395650 fffff88003953931 : fffff88010395730 0000000000000000
fffffa802ad15b20 0000000000000000 : ndis!ndisSortNetBufferLists+0x1b6
fffff880103956f0 fffff880038ec655 : fffffa802cc0c1a0 fffffa802ad15b20
fffffa8000000000 0000000000000001 :
ndis!ndisMDispatchReceiveNetBufferLists+0x121
fffff88010395a40 fffff880038ebc35 : fffffa802cc0c1a0 fffffa802ad15b20
0000000000000000 0000000000000001 :
ndis!ndisMDispatchReceiveNetBufferListsWithLock+0xb1
fffff88010395ab0 fffff880038ebe55 : fffffa802cc0c1a0 fffffa802ad15b20
fffffa802cc0c1a0 fffffa802cc0d300 :
ndis!ndisDoPeriodicReceivesIndication+0x28d
fffff88010395b30 fffff880038ec923 : fffffa802cc0c1a0 0000000000000002
0000000000000000 fffffa802cc0d388 : ndis!ndisPeriodicReceivesWorker+0x91
fffff88010395b70 fffff80004161bc6 : 0000000000000008 fffffa8020416b50
0000000000000080 0000000000000000 : ndis!ndisReceiveWorkerThread+0x2ab
fffff88010395c00 fffff80003ebb6a6 : fffff80004048e80 fffffa8020416b50
fffff80004056cc0 0000000000000000 : nt!PspSystemThreadStartup+0x5a
fffff88010395c40 0000000000000000 : fffff88010396000 fffff88010390000
fffff880103958c0 0000000000000000 : nt!KxStartSystemThread+0x16

Are we correct in assuming that we feed NDIS a netbufferlist that is
corrupt?
We also assume that this happens during disable of the interface, because
this is always what we see as last log outputs from the driver:
[MNA-180 NDIS] [FFFFFA801D220040] —> MPPause
[MNA-180 NDIS] [FFFFFA801D220040] —> MPRestart
[MNA-180 NDIS] [FFFFFA801D220040] —> MPPause
[MNA-180 NDIS] [FFFFFA801D220040] —> MPRestart
[MNA-180 NDIS] [FFFFFA801D220040] —> MPPause
[MNA-180 NDIS] [FFFFFA801D220040] —> MPHaltEx (enters here for reason 0)
[MNA-180 NDIS] [FFFFFA801D220040] —> MPShutdownEx
[MNA-180 NDIS] HwFreeReceiveEngine pHw->nBusyReceive=[0]

We do not indicate any new NBLs to NDIS from the moment this fails:
if(!MP_IS_READY(pAdapter)) with :

Any ideas on how to debug this further are greatly appreciated !!

Thank you in advance,

Bernard Willaert

Jan_Bottorff · September 15, 2016, 8:57pm

I assume you are doing the MPPause asynchronously? And don’t complete the pause request until the upper layers have given back all the receive nbl’s? And you have given back any send requests?

My debug suggestion is to turn on nbl tracking (TrackNblOwner registry set to about 3). When it crashes on the null pointer dump the nbl’s with !ndiskd.pendingnbls and see if you have completed the pause while nbl’s are still outstanding. It would be bad to rip nb/nbl data structures out from under the upper layers. The crash seems like its in some worker thread processing.

Jan

From: xxxxx@lists.osr.com on behalf of xxxxx@barco.com
Sent: Wednesday, September 14, 2016 11:48:12 PM
To: Windows System Software Devs Interest List
Subject: RE:[ntdev] NDIS6.2 outstanding RCBs timeout ?

I bring this issue back up because we did not succeed yet in solving it.

When we continuously enable/disable the network interface using a script with devcon, the test will fail about 1/500, always with an identical stack trace - pointing to ndisSortNetBufferList:

TRAP_FRAME: fffff880103954c0 – (.trap 0xfffff880103954c0)
NOTE: The trap frame does not contain all registers.
Some register values may be zeroed or incorrect.
rax=fffffa802ad15c50 rbx=0000000000000000 rcx=fffffa802bb85340
rdx=0000000000000000 rsi=0000000000000000 rdi=0000000000000000
rip=fffff880039533aa rsp=fffff88010395650 rbp=0000000000001000
r8=0000000000000000 r9=0000000000000000 r10=0000000000000000
r11=0000000000000000 r12=0000000000000000 r13=0000000000000000
r14=0000000000000000 r15=0000000000000000
iopl=0 nv up ei pl zr na po nc
ndis!ndisSortNetBufferLists+0x1b6:
fffff880039533aa 8a420d mov al,byte ptr [rdx+0Dh] ds:000000000000000d=??
Resetting default scope

LAST_CONTROL_TRANSFER: from fffff80003ec89a9 to fffff80003ec9400

STACK_TEXT:
fffff88010395378 fffff80003ec89a9 : 000000000000000a 000000000000000d 0000000000000002 0000000000000000 : nt!KeBugCheckEx
fffff88010395380 fffff80003ec7620 : fffffa802c569010 fffff88010395510 fffffa802c569010 0000000000001000 : nt!KiBugCheckDispatch+0x69
fffff880103954c0 fffff880039533aa : fffffa802cc061a0 fffff8800396576c 0000000000000000 fffffa802a80a340 : nt!KiPageFault+0x260
fffff88010395650 fffff88003953931 : fffff88010395730 0000000000000000 fffffa802ad15b20 0000000000000000 : ndis!ndisSortNetBufferLists+0x1b6
fffff880103956f0 fffff880038ec655 : fffffa802cc0c1a0 fffffa802ad15b20 fffffa8000000000 0000000000000001 : ndis!ndisMDispatchReceiveNetBufferLists+0x121
fffff88010395a40 fffff880038ebc35 : fffffa802cc0c1a0 fffffa802ad15b20 0000000000000000 0000000000000001 : ndis!ndisMDispatchReceiveNetBufferListsWithLock+0xb1
fffff88010395ab0 fffff880038ebe55 : fffffa802cc0c1a0 fffffa802ad15b20 fffffa802cc0c1a0 fffffa802cc0d300 : ndis!ndisDoPeriodicReceivesIndication+0x28d
fffff88010395b30 fffff880038ec923 : fffffa802cc0c1a0 0000000000000002 0000000000000000 fffffa802cc0d388 : ndis!ndisPeriodicReceivesWorker+0x91
fffff88010395b70 fffff80004161bc6 : 0000000000000008 fffffa8020416b50 0000000000000080 0000000000000000 : ndis!ndisReceiveWorkerThread+0x2ab
fffff88010395c00 fffff80003ebb6a6 : fffff80004048e80 fffffa8020416b50 fffff80004056cc0 0000000000000000 : nt!PspSystemThreadStartup+0x5a
fffff88010395c40 0000000000000000 : fffff88010396000 fffff88010390000 fffff880103958c0 0000000000000000 : nt!KxStartSystemThread+0x16

Are we correct in assuming that we feed NDIS a netbufferlist that is corrupt?
We also assume that this happens during disable of the interface, because this is always what we see as last log outputs from the driver:
[MNA-180 NDIS] [FFFFFA801D220040] —> MPPause
[MNA-180 NDIS] [FFFFFA801D220040] —> MPRestart
[MNA-180 NDIS] [FFFFFA801D220040] —> MPPause
[MNA-180 NDIS] [FFFFFA801D220040] —> MPRestart
[MNA-180 NDIS] [FFFFFA801D220040] —> MPPause
[MNA-180 NDIS] [FFFFFA801D220040] —> MPHaltEx (enters here for reason 0)
[MNA-180 NDIS] [FFFFFA801D220040] —> MPShutdownEx
[MNA-180 NDIS] HwFreeReceiveEngine pHw->nBusyReceive=[0]

We do not indicate any new NBLs to NDIS from the moment this fails:
if(!MP_IS_READY(pAdapter)) with :

#define MP_IS_READY(_M) (((_M)->Flags & (fMP_DISCONNECTED | fMP_RESET_IN_PROGRESS | fMP_ADAPTER_HALT_IN_PROGRESS | fMP_ADAPTER_PAUSE_IN_PROGRESS | fMP_ADAPTER_PAUSED | fMP_ADAPTER_IN_LOW_POWER )) == 0)

Any ideas on how to debug this further are greatly appreciated !!

Thank you in advance,
- Bernard Willaert

—
NTDEV is sponsored by OSR

Visit the list online at: http:

MONTHLY seminars on crash dump analysis, WDF, Windows internals and software drivers!
Details at http:

To unsubscribe, visit the List Server section of OSR Online at http:</http:></http:></http:>

Bernard_Willaert-2 · September 16, 2016, 3:19am

Scott & Jan,
Thanks a lot for your valuable inputs !
Greatly appreciated !

@Scott:
I sent a minidump via WeTransfer.
Thank you in advance for looking into it.
The good news is that the dumps are always consistent, no matter what we tried to modify in the MPhalt / MPShutdown sequence.

@Jan:
Yes, we are handling MPPause asynchronously – code below.
The NICIsBusy() function checks if there are any receive blocks in use. We only have 1 receive queue.
The number of receive blocks in use is incremented at 1 central point where NdisMIndicateReceiveNetBufferLists is called.
It is decremented in MPReturnNetBufferLists.

>>My debug suggestion is to turn on nbl tracking (TrackNblOwner registry set to about 3).

We will definitely try this.
A desparate attempt was already made adding NdisMSleep(100000); before freeing any memory resources, but even this didn’t get rid of the issue.
Our assumption was that 100 ms delay would give NDIS enough time to return any NBLs that it still had in use…

------------------MPPause:
NDIS_STATUS Status = NDIS_STATUS_SUCCESS;

PMP_ADAPTER Adapter = MP_ADAPTER_FROM_CONTEXT(MiniportAdapterContext);

LOG(LOG_TEST, “[%p] —> MPPause”, Adapter);
UNREFERENCED_PARAMETER(MiniportPauseParameters);
PAGED_CODE();

MP_SET_FLAG(Adapter, fMP_ADAPTER_PAUSE_IN_PROGRESS);

NICStopTheDatapath(Adapter);

if (NICIsBusy(Adapter))
{
//
// The adapter is busy sending or receiving data, so the pause must
// be completed asynchronously later.
//
LOG(LOG_INFO, “[%p] MPPause NIC is busy:[%d]”, Adapter, NICIsBusy(Adapter));
NICScheduleTheResetOrPauseDpc(Adapter, FALSE);
Status = NDIS_STATUS_PENDING;
}
else // !NICIsBusy
{
//
// The pause operation has completed synchronously.
//
LOG(LOG_INFO, “[%p] MPPause NIC is no more busy:[%d]”, Adapter, NICIsBusy(Adapter));
MP_SET_FLAG(Adapter, fMP_ADAPTER_PAUSED);
MP_CLEAR_FLAG(Adapter, fMP_ADAPTER_PAUSE_IN_PROGRESS);
Status = NDIS_STATUS_SUCCESS;
}

LOG(LOG_TRACE, “[%p] <— MPPause”, Adapter);
return Status;

Thanks again,

Bernard Willaert

Bernard_Willaert-2 · September 16, 2016, 4:35am

Thanks to Jan’s input we now realize that the minidump does not contain all the information needed to do !ndiskd.pendingnbls.
We enabled a full memory crash dump and are waiting for the next crash - test running…

An interesting article on a very similar issue:
https://blogs.msdn.microsoft.com/ntdebugging/2013/03/29/debugging-a-network-connectivity-issue-tracknblowner-to-the-rescue/

Thanks,

Bernard Willaert

Bernard_Willaert-2 · September 17, 2016, 5:30am

After running the interface disable/enable test for over 1200 times, the crash occurred again. This time I had a full memory dump and a short analysis in windbg showed that there are some pending NBLs:

4: kd> !ndiskd.pendingnbls

PHASE 1/3: Found 21 NBL pool(s).
PHASE 2/3: Found 387 freed NBL(s).

Pending Nbl Currently held by
fffffa802855de20 NDIS [NDIS]
fffffa8027335b20 fffffa80284eb1a0 - MNA-1x0 Network Adapter (NDIS 6.20) #5 [Miniport]
fffffa8028506a50 fffffa80284431a0 - MNA-1x0 Network Adapter (NDIS 6.20) #4 [Miniport]

PHASE 3/3: Found 3 pending NBL(s) of 1164 total NBL(s).
Search complete.

kd> !ndiskd.miniport fffffa80284eb1a0

MINIPORT

MNA-1x0 Network Adapter (NDIS 6.20) #5

Ndis handle fffffa80284eb1a0
Ndis API version v6.20
Adapter context fffffa8028379a80
Miniport driver fffffa8027b2f350 - MNA-180_NDIS v4.2
Network interface fffffa800e45a870

Media type 802.3
Device instance {B85B7C50-6A01-11d2-B841-00C04FAD5171}\MNA-180-NIC\5&eebdd3a&1c&00
Device object fffffa80284eb050 More information
MAC address 00-04-a5-12-14-00

AUTOMATIC DIAGNOSTICS

This miniport will operate in a slow path until 2 more protocol(s) bind, or
until a 90-second timeout expires.
What does this mean?

STATE

Miniport Running
Device PnP Started
Datapath Normal
Interface Up
Media Connected
Power D0
References 7
Total resets 0
Pending OID None
Flags BUS_MASTER, DEFAULT_PORT_ACTIVATED,
SUPPORTS_MEDIA_SENSE, DOES_NOT_DO_LOOPBACK,
MEDIA_CONNECTED
PnP flags VIRTUAL_DEVICE, RECEIVED_START, NDIS_WDM_DRIVER

BINDINGS

Protocol list Driver Open Context
TCPIP fffffa801cfcb3a0 fffffa802813e8d0 fffffa80281b1010
LLTDIO fffffa801f331890 fffffa80280f68d0 fffffa80285cd370
RSPNDR fffffa801f31c7e0 fffffa8028138230 fffffa80283d56b0

Filter list Driver Module Context
WFP LightWeight Filter-0000
fffffa801e0f4d60 fffffa80284da4b0 fffffa802828d870
QoS Packet Scheduler-0000
fffffa801e0f4900 fffffa8028466010 fffffa8027b40980

MORE INFORMATION

Driver handlers Task offloads
Power management PM protocol offloads
Pending OIDs Timers
Pending NBLs
Wake-on-LAN (WoL) Packet filter
Receive queues Receive filtering
RSS NIC switch
Hardware resources
NDIS ports WMI guids

It is not clear to me - apart from the obvious NBLs still in use, what exactly causes the crash here. Is it in any way related to shutting down the miniport ?

Thanks for your help !

Bernard Willaert

Jan_Bottorff · September 17, 2016, 7:36pm

It says the miniport state for instance fffffa80284eb1a0 is running, which seems a bit different than I remember the trace saying. It looks like you have two miniport instances, as there are NBLs currently held by two instances? What’s the state of the other instance fffffa80284431a0.

You might also look more at the NBL owned by NDIS, fffffa802855de20, and see who originated it. You might also look through the crash stack walks and see if any function parameters match up with the pending NBLs.

You should figure out the originator of those outstanding NBLs, you might need to look at their originating pool and find the pool owner.

I’ve never seen that message about the slow path, perhaps MSFT could give us some insight.

Jan

On 9/17/16, 2:31 AM, “xxxxx@lists.osr.com on behalf of xxxxx@barco.com” wrote:

After running the interface disable/enable test for over 1200 times, the crash occurred again. This time I had a full memory dump and a short analysis in windbg showed that there are some pending NBLs:

4: kd> !ndiskd.pendingnbls

PHASE 1/3: Found 21 NBL pool(s).
PHASE 2/3: Found 387 freed NBL(s).

Pending Nbl Currently held by
fffffa802855de20 NDIS [NDIS]
fffffa8027335b20 fffffa80284eb1a0 - MNA-1x0 Network Adapter (NDIS 6.20) #5 [Miniport]
fffffa8028506a50 fffffa80284431a0 - MNA-1x0 Network Adapter (NDIS 6.20) #4 [Miniport]

PHASE 3/3: Found 3 pending NBL(s) of 1164 total NBL(s).
Search complete.

------------------

kd> !ndiskd.miniport fffffa80284eb1a0

MINIPORT

MNA-1x0 Network Adapter (NDIS 6.20) #5

Ndis handle fffffa80284eb1a0
Ndis API version v6.20
Adapter context fffffa8028379a80
Miniport driver fffffa8027b2f350 - MNA-180_NDIS v4.2
Network interface fffffa800e45a870

Media type 802.3
Device instance {B85B7C50-6A01-11d2-B841-00C04FAD5171}\MNA-180-NIC\5&eebdd3a&1c&00
Device object fffffa80284eb050 More information
MAC address 00-04-a5-12-14-00

AUTOMATIC DIAGNOSTICS

This miniport will operate in a slow path until 2 more protocol(s) bind, or
until a 90-second timeout expires.
What does this mean?

STATE

Miniport Running
Device PnP Started
Datapath Normal
Interface Up
Media Connected
Power D0
References 7
Total resets 0
Pending OID None
Flags BUS_MASTER, DEFAULT_PORT_ACTIVATED,
SUPPORTS_MEDIA_SENSE, DOES_NOT_DO_LOOPBACK,
MEDIA_CONNECTED
PnP flags VIRTUAL_DEVICE, RECEIVED_START, NDIS_WDM_DRIVER

BINDINGS

Protocol list Driver Open Context
TCPIP fffffa801cfcb3a0 fffffa802813e8d0 fffffa80281b1010
LLTDIO fffffa801f331890 fffffa80280f68d0 fffffa80285cd370
RSPNDR fffffa801f31c7e0 fffffa8028138230 fffffa80283d56b0

Filter list Driver Module Context
WFP LightWeight Filter-0000
fffffa801e0f4d60 fffffa80284da4b0 fffffa802828d870
QoS Packet Scheduler-0000
fffffa801e0f4900 fffffa8028466010 fffffa8027b40980

MORE INFORMATION

Driver handlers Task offloads
Power management PM protocol offloads
Pending OIDs Timers
Pending NBLs
Wake-on-LAN (WoL) Packet filter
Receive queues Receive filtering
RSS NIC switch
Hardware resources
NDIS ports WMI guids

------------------------------------------------

It is not clear to me - apart from the obvious NBLs still in use, what exactly causes the crash here. Is it in any way related to shutting down the miniport ?

Thanks for your help !
- Bernard Willaert

—
NTDEV is sponsored by OSR

Visit the list online at: http:

MONTHLY seminars on crash dump analysis, WDF, Windows internals and software drivers!
Details at http:

To unsubscribe, visit the List Server section of OSR Online at http:</http:></http:></http:>

Bernard_Willaert-2 · September 18, 2016, 4:23am

Thanks for your input, Jan !

Regarding the initmode:
This miniport will operate in a slow path until 2 more protocol(s) bind, or
until a 90-second timeout expires.
>>> What does this mean? >> This is actually a url, which results in this explanation:

!ndiskd.help initmode
For performance reasons, NDIS tries to avoid repeatedly pausing/restarting
the miniport for bind, since that will result in many OperStatus down/up
notifications. (Usermode tends to do some work for each notification,
including possibly sending DHCP packets). Therefore, when a miniport is
first initialized, NDIS will bind all protocols and filters at once. This
reduces the number of times the miniport must be paused/restarted.

However, at boot, drivers may load in any order. In the worst-case, a
miniport might load first, followed by each protocol, and then by each
filter, one-at-a-time. In that situation, NDIS would still have to
pause/restart the miniport many times, since protocols and filters only
show up one-by-one.

To avoid slowing down boot with many pause/restarts, NDIS will operate the
miniport in a special mode during boot: Init Mode. A miniport that is in
Init Mode will not pause during a protocol bind or filter attach. However,
this means that the datapath must be synchronized with the PNP path. So
during Init Mode, NDIS acquires a spinlock for each packet (NBL). This
slows down the datapath slightly, but significantly speeds up system boot.

A miniport should only be in Init Mode until all protocols and filters have
attempted to bind. However, if a protocol or filter is supposed to bind to
the miniport, but the driver doesn’t register with NDIS, then NDIS will
wait (keeping the miniport in Init Mode) until the protocol or filter
finally shows up. This means that it is detrimental to install a protocol
or filter over a miniport, but then to not actually start the driver and
register with NDIS.

If you see a miniport stuck in Init Mode for the full timeout (30 - 90
seconds, depending on the OS version), you will see reduced performance on
the datapath during the timeout period. This does not have an effect on
correctness or safety – it is merely a warning that your datapath
performance will be temporarily sub-optimal during boot.

There are 2 miniport instances - for 2 different network interfaces.

One thing I also find odd, when looking into the miniport characteristics is in the miniport STATE, the PnP flags:

PnP flags VIRTUAL_DEVICE, RECEIVED_START, NDIS_WDM_DRIVER

And when looking into the inf:

[MNA-180_Device.NT]
Characteristics = 0x1 ; NCF_VIRTUAL

This was taken from the original DDK sample.
In our case this should be 0x4 = NCF_PHYSICAL, as we are a real adapter.

Could this possibly make any difference in the crash we experience ?

Bernard Willaert

Bernard_Willaert-2 · September 18, 2016, 4:40am

[MNA-180_Device.NT]
Characteristics = 0x1 ; NCF_VIRTUAL

should be:

[MNA-180_Device.NT]
Characteristics = 0x4 ; NCF_PHYSICAL
BusType = 0x05; PCIBus

Could this lead to the sort of crash we experienced ?

Thanks,

Bernard Willaert

Jan_Bottorff · September 18, 2016, 6:41am

Is the device directly on the PCIe bus? Or is the device a child of a virtual bus driver?

Reviewing your crash stack trace again…

nt!KiPageFault
ndis!ndisSortNetBufferLists
ndis!ndisMDispatchReceiveNetBufferLists
ndis!ndisMDispatchReceiveNetBufferListsWithLock
ndis!ndisDoPeriodicReceivesIndication
ndis!ndisPeriodicReceivesWorker

This looks suspiciously like code that indicates deferred receive packets with a lock held, a lot like the description you quoted for slow mode.

A debugging suggestion would be to put a breakpoint on ndisMDispatchReceiveNetBufferListsWIthLock, and see what the flow looks like under normal conditions. This would take digging through the assembler. The goal is to understand if this is indicating YOUR packets, and if so can you understand how to find the list of NBLs when it crashes. If the chain of NBls is damaged, if you can find the head of the chain and ask !ndiskd.nbl to walk the chain, you can perhaps see what’s wrong.

The prototype for the indicate function:

VOID NdisMIndicateReceiveNetBufferLists(
In NDIS_HANDLE MiniportAdapterHandle,
In PNET_BUFFER_LIST NetBufferLists,
In NDIS_PORT_NUMBER PortNumber,
In ULONG NumberOfNetBufferLists,
In ULONG ReceiveFlags
);

Taking a random intuitive stab at it, perhaps when you indicate the receives, the number of NBLs parameter is correct, but the last NBL has a non-null next field, and this deferred processing ignores the element count and just runs until next is null, and the last NBL has a leftover link to something invalid. Or since it’s a null reference at offset 0x0D, perhaps the number of NBLs does not match the number of entries in the list, and this deferred code tries to process the number of NBLs you told it and faults when it tries to access a null next NBL.

I see in the crashdump there are a number of register with kernel address like values. You might see if any of those is an NB or NBL by feeding them to !ndiskd.nbl.

I suppose there is the question, is it possible you are indicating an empty list of NBLs under some condition? Like you process receives in batches, but under some conditions you end up with an empty batch.

I assume you have turned on driver verifier with NDIS checking enabled, and turned the flags up pretty far? This tends to a do a bunch of validation of data structures. Even better is the checked NDIS.sys, but that’s trickier, and on some OS versions, like Win 10, seems unavailable. Since you are NDIS 6.2, you could run with the checked Server 2012 R2 build.

Running and passing the WHQL/HLK tests is quite useful for NIC drivers, even before you’re ready to get certification. It would be less painful to setup the HLK if MSFT just shipped a preconfigured VM.

Since you only see the issue after many enable/disable cycles, you might try changing the behavior of the system, like run it with a single core enabled, or run it on a system with lots of and lots of cores. This tends to alter synchronization behavior. I believe win 10 driver verifier has synchronization fuzzing as an option too, and your driver should work on Win 10.

You might also turn on NDIS ETW/WPP tracing, and see if there are useful messages shortly before the crash.

Jan

On 9/18/16, 1:40 AM, “xxxxx@barco.com” wrote:

[MNA-180_Device.NT]
Characteristics = 0x1 ; NCF_VIRTUAL

should be:

[MNA-180_Device.NT]
Characteristics = 0x4 ; NCF_PHYSICAL
BusType = 0x05; PCIBus

Could this lead to the sort of crash we experienced ?

Thanks,
- Bernard Willaert

—
NTDEV is sponsored by OSR

Visit the list online at: http:

MONTHLY seminars on crash dump analysis, WDF, Windows internals and software drivers!
Details at http:

To unsubscribe, visit the List Server section of OSR Online at http:</http:></http:></http:>

Bernard_Willaert-2 · September 18, 2016, 1:58pm

>>>Is the device directly on the PCIe bus? Or is the device a child of a virtual
bus driver?

There is a WDF bus driver that allocates all the HW resources on the board: MSI interrrupts, BARs, DMA,… It creates 2 or 4 child devices = network interfaces with their associated miniport driver and hands them their respective portion of the HW resources.

In that way, it can remain a virtual device, since the network device is a child device of the owner of the bus resources.

I first want to find out if the crash occurs while enabling the device or disabling it.

It looks like we passed an NBL to NDIS with a specified length but with an invalid FirstNBL pointer.

Thanks again for your input, Jan !

Bernard Willaert

Bernard_Willaert-2 · September 19, 2016, 6:59am

Some more input on this crash:

adding extra logging in the test script that disables/enables the interfaces with devcon, shows that the crash occurs after enable – and not disable as we assumed so far ( where we would free NBLs that NDIS is still holding)
the crash occurs just after calling NdisMIndicateReceiveNetBufferLists()
NBLs in use:
kd> !ndiskd.pendingnbls

PHASE 1/3: Found 21 NBL pool(s).
PHASE 2/3: Found 409 freed NBL(s).

Pending Nbl Currently held by
fffffa802ae47520 NDIS [NDIS]
fffffa802fa24d50 fffffa802fa7e1a0 - MNA-1x0 Network Adapter (NDIS 6.20) #4 [Miniport]

PHASE 3/3: Found 2 pending NBL(s) of 1182 total NBL(s).
Search complete.

The NBL that NDIS is holding is invalid:

kd> !ndiskd.nbl fffffa802ae47520
NBL fffffa802ae47520 Next NBL NULL
First NB fffffa802ae47650 Source NULL
Flags INDICATED, NBL_ALLOCATED

Walk the NBL chain Dump data payload
Show out-of-band information

4: kd> !ndiskd.nbl fffffa802ae47520 -info

NET_BUFFER_LIST_INFO

TcpIpChecksumNetBufferListInfo 0
TcpOffloadBytesTransferred 0
IPsecOffloadV1NetBufferListInfo 0
IPsecOffloadV2NetBufferListInfo 0
TcpLargeSendNetBufferListInfo 0
TcpReceiveNoPush 0
ClassificationHandleNetBufferListInfo NULL
Ieee8021QNetBufferListInfo 0
NetBufferListCancelId NULL
MediaSpecificInformation 0
NetBufferListFrameType NDIS_PROTOCOL_ID_DEFAULT
NetBufferListHashValue 0
NetBufferListHashInfo 0
WfpNetBufferListInfo NULL
IPsecOffloadV2TunnelNetBufferListInfo 0
IPsecOffloadV2HeaderNetBufferListInfo 0
NetBufferListCorrelationId NULL
NetBufferListFilteringInfo 0
MediaSpecificInformationEx NULL
NblOriginalInterfaceIfIndex 0
NblReAuthWfpFlowContext NULL
TcpReceiveBytesTransferred 0
NblCurrentOwner ffffffffffffffff ----->>> ???

4: kd> !ndiskd.nbl fffffa802ae47520 -data
NET_BUFFER fffffa802ae47650
MDL fffffa8029995490
Warning: MDL fffffa8029995490 does not appear to be mapped into kernel
address space
Warning: the packet extends 338 byte(s) beyond the last MDL

Is there a way to check if the MDL is a valid mapped address before we indicate the NBL?

Thanks,

Bernard Willaert

Jan_Bottorff · September 19, 2016, 2:22pm

That sounds like a corrupted MDL chain. You could manually walk the MDL chain for each NB and map it with MmGetSystemAddressForMdlSafe/NdisQueryMdl before indicating it. If the chain is corrupt that should trigger a fault while it’s still in your code. I suppose you would need to walk the MDL chain with an awareness of the NB specified packet length to see if the below message is correct.

Jan

On 9/19/16, 4:00 AM, “xxxxx@lists.osr.com on behalf of xxxxx@barco.com” wrote:

4: kd> !ndiskd.nbl fffffa802ae47520 -data
NET_BUFFER fffffa802ae47650
MDL fffffa8029995490
Warning: MDL fffffa8029995490 does not appear to be mapped into kernel
address space
Warning: the packet extends 338 byte(s) beyond the last MDL

Is there a way to check if the MDL is a valid mapped address before we indicate the NBL?

Bernard_Willaert-2 · September 22, 2016, 4:14am

Thank you all for your help so far !

We strongly suspect that our bookkeeping of NBLs that is incremented with NdisMIndicateReceiveNetBufferLists and decremented with MPReturnNetBufferLists is not in sync during an occasional race condition.

Therefore, wouldn’t it be much simpler if we could just ask before shutdown how many NBLs NDIS is still holding ? Like the windbg command !ndiskd.pendingnbls ?

Incrementing and decrementing a counter in a non-atomic way in 2 different places seems like a recipe for failure ?

Bernard_Willaert-2 · September 28, 2016, 3:35am

RESOLVED -
It was observed both in the field and during functional tests that the network interface driver crashed while disabling and re-enabling it again.
This would happen randomly. Sometimes it was possible to disable/enable the interface 10 times without problem and sometimes it would crash the very first time.
The crash could be reproduced by running a script that continuously disabled and re-enabled our optical interfaces using devcon ( command line version of the device manager UI).
When the test was running overnight, on an idle system, the crash would typically occur after several hours and over 1000 disable/enable cycles.
The initial minidump stack traces were consistent: NDIS.sys ? a Windows core component ? crashed while processing an NBL (netbuffer list) containing network packets
received by our hardware and indicated to the NDIS driver from our miniport network adapter driver.
A typical minidump stack trace would look like: ( sequential from bottom to top) :
? nt!KeBugCheckEx
? nt!KiBugCheckDispatch+0x69
? nt!KiPageFault+0x260
? ndis!ndisSortNetBufferLists+0x1b6
? ndis!ndisMDispatchReceiveNetBufferLists+0x121
? ndis!ndisMDispatchReceiveNetBufferListsWithLock+0xb1
? ndis!ndisMTopReceiveNetBufferLists+0x3a
? ndis!ndisMIndicateReceiveNetBufferListsInternal+0x8f
? ndis!NdisMIndicateReceiveNetBufferLists+0x6f
? MNA_180_NDIS!RXReceiveIndicate+0x37e [d:\personal\projects\compositor\software\drivers\network\windows\datapath.c @ 1114]
? MNA_180_NDIS!RXReceiveIndicateDpc+0x2a [d:\personal\projects\compositor\software\drivers\network\windows\datapath.c @ 865]
? nt!KiRetireDpcList+0x1bc
? nt!KiIdleLoop+0x5a
We see that the last action from our miniport driver is the indication / transaction to NDIS of received netbuffer lists: NdisMIndicateReceiveNetBufferLists.
Our miniport driver flags also indicate to NDIS not to take a local copy of the netbuffer lists but to operate on them while they reside in memory allocated by our minidriver (zero copy).
We can see that NDIS crashes when it starts processing the contents of this netbuffer lists: ndisSortNetBufferLists. This results in a fatal memory page fault.
Our first reaction was that we freed the memory containing the netbuffer list while shutting down the miniport driver and releasing all resources,
while NDIS was still busy with one of the netbuffer lists we had indicated just before shutdown ( by disabling the device).

When asked on the OSR forum, the experts suggested to unleash some ndis specific commands on a full memory dump ( of all memory allocated at that time ).
The Windbg command that is really interesting here is !ndiskd.pendingnbls. This shows a list of all NetBufferLists that are in use at the time of the crash.
On a full memory dump this resulted typically in:
kd> !ndiskd.pendingnbls

PHASE 1/3: Found 21 NBL pool(s).
PHASE 2/3: Found 363 freed NBL(s).

Pending Nbl Currently held by
fffffa8020f1dbd0 NDIS [NDIS]
fffffa80269bebd0 fffffa8026e0c1a0 - MNA-1x0 Network Adapter (NDIS 6.20) #5 [Miniport]
fffffa8026b815e0 fffffa8026e0c1a0 - MNA-1x0 Network Adapter (NDIS 6.20) #5 [Miniport]

A constant in all memory dumps is the fact that there is 1 NBL held by NDIS.
When further analyzing this NBL, it always reports that the data described in the NBL points to memory that is not mapped at the time of the crash.

And further looking into the NDIS held NBL:

kd> !ndiskd.nbl fffffa8020f1dbd0
NBL fffffa8020f1dbd0 Next NBL NULL
First NB fffffa8020f1dd00 Source NULL
Flags INDICATED, NBL_ALLOCATED

And when checking the data payload:
kd> !ndiskd.nbl fffffa8020f1dbd0 -data
NET_BUFFER fffffa8020f1dd00
MDL fffffa802687b8a0
Warning: MDL fffffa802687b8a0 does not appear to be mapped into kernel address space
Warning: the packet extends 338 byte(s) beyond the last MDL

With 338 bytes being the length indicated in this MDL

We spent quite some time suspecting the bookkeeping of the number of NBLs indicated to NDIS, that was 1 off when freeing the memory holding the NBLs.
At shutdown, the amount of NBLs in flight should be zero and the shutdown/halt callback can only be completed safely when all NBLs are returned to us.
We added some extra sanity checks - including dummy wait cycles - to make sure that all NBLs wer returned to the minidriver.
We also made sure that there was a single point of access to indicate the NBLs to NDIS, that was disabled when the adapter was in one of the non-active states (shutting down, paused, …).

This did not solve the problem, so we turned our attention to the (re-)initialization of the device.

During the adapter initialization callback we allocate 2 (RX and TX) regions of coniguous memory needed for our DMA engine using MmAllocateContiguousMemory.
Received packets are DMA’d into this region = 256 PAGES = 256 packets in our case.

However, there was no check on the return value and the memory region was not zeroed out ( as described in the documentation on MmAllocateContiguousMemory).
So, we always succeeded the miniport initialization callback, even when this allocation failed ( on a heavily scattered system ).

Next, the MDL describing this memory was passed to NDIS with the first packet received and when accessing the data content, the system crashed.

Adding a check on success solved the issue:
when the allocation fails, the entire miniport initialization callback fails with a resource failure
and the adapter shows up in the device manager with an exclamation mark next to it.
A reboot is the only solution at that point.
According to the documentation, we should also call NdisWriteErrorLogEntry when the initialization fails.

I still have a question though:
When MmAllocateContiguousMemory returns a NULL virtual address, is it normal that NdisAllocateMdl does not fail when passing this address as VirtualAddress parameter ?

Thanks again for your help and inputs ! Without the extra NDIS specific windbg commands it would have taken an even longer time to nail this one. Static code review of course shows the error immediately, but you need to know where to look.

Best regards,

Bernard Willaert

NDIS6.2 outstanding RCBs timeout ?

LOG(LOG_TRACE, “[%p] <— MPPause”, Adapter); return Status;

LOG(LOG_TRACE, “[%p] <— MPPause”, Adapter);
return Status;