NDIS6 - mutual recursion between miniport OidQueryHandler and NdisMOidRequestComplete leading to sta

Hello Everyone

I have a bug check (alas, at customer) in the network stack, which is easily
identified as a stack overflow, but the stack is something the like of which
I’ve never encountered before in his space. I was wondering if any of the
NDIS gurus here have seen the like and would be kind enough to show some
signposts.

To set the stage a little, the adapter is an HP NC 375i (quad port, as it
happens) which is NetXen/QLogin chipset. The windows version is Server 2008
R2. The NDIS 6 adapter miniport driver is hpnd6x64, above which we have
(mine) NDIS 5 IM Filter NfPktFltr (child of passthru). We also have a couple
of NDIS 6 LWF in the picture - wfplwf and pacer.

So what happened here?

Well, (mine) NDIS 5 IM filter NfPktFltr QueryInformationHandler() has been
called, and has returned, after calling NDisRequest() as usual to pass the
request on down to the lower miniport. The request seems to have been given
to a system worker thread, and the errant stack is that of the system worker
thread. The business starts with ndisDoOidRequests(), which calls
ndisMDoOidRequest(), which in turn calls NDIS 6 MP hpnd6x64
OidRequestHandler(). Everything as expected up to here.

Now NDIS 6 MP hpnd6x64 OidRequestHandler() calls NdisMOidRequestComplete(),
which calls ndisMOidRequestCompleteInternal(). Everything as expected up to
here. The surprise is that ndisMOidRequestCompleteInternal() then calls back
into ndisMDoOidRequest(), which in turn calls back into NDIS 6 MP hpnd6x64
OidRequestHandler(). This mutual recursion happens more than 50 times, and
thread stack space is getting short now, eventually
ndisMOidRequestCompleteInternal() calls ndisOidRequestComplete(), which
calls ndisCompleteOidRequestToRequest(), which in turn calls into NDIS 5 IM
NfPktFltr RequestCompleteHandler(). So NDIS 5 IM NfPktFltr
RequestCompleteHandler() calls NdisMQueryInformationComplete() and a couple
of frames later ndisOidRequestComplete() calls KeSetEvent(). Everything
normal at the end here, except of course the stack overflow.

The part here of which I haven’t encountered the like in this space, well of
course it’s the mutual recursion around the loop NDIS 6 MP
OidRequestHandler() – NdisMOidRequestComplete() –
ndisMOidRequestCompleteInternal() – ndisMDoOidRequest() - NDIS 6 MP
OidRequestHandler(). If anyone has seen this before, can offer any ideas,
I’d be extremely grateful for signposts.

I’ve put the annotated stack trace below - compressed so as not to show the
same information 50+ times over :slight_smile:

Thanks,
Lyndon

8: kd> kv nf

Memory Child-SP RetAddr : Args to Child

: Call Site
00 fffff8800211bfe8 fffff8000166fca9 : 000000000000007f 0000000000000008 0000000080050033 00000000000006f8 : nt!KeBugCheckEx
01 8 fffff8800211bff0 fffff8000166e172 : 0000000000000000 0000000000000000 0000000000000000 0000000000000000 :
nt!KiBugCheckDispatch+0x69
02 140 fffff8800211c130 fffff8000167486c : 0000000000000000 0000000000000000 0000000000000000 0000000000000000 :
nt!KiDoubleFaultAbort+0xb2 (TrapFrame @ fffff880`0211c130)

** Stack overflow

03 5fbed0 fffff88002718000 fffff88000e02fef : 0000000000000000 fffff88000000000 fffff88000e63700 fffffa8007f22168 : nt!KeSetEvent+0x1dc
04 70 fffff88002718070 fffff88000e4d91c : fffff88002115101 0000000000000000 fffff8800c9339e0 0000000000000000 :
NDIS!ndisOidRequestComplete+0x44f
05 90 fffff88002718100 fffff88000e4db9e : fffffa8008a669a0 0000000000000000 fffffa800e2e11a0 fffffa8000000230 :
NDIS!ndisMOidRequestCompleteInternal+0xdc
06 80 fffff88002718180 fffff88000e9a629 : fffffa800e2e11a0 fffffa801235f010 fffff88000e63110 0000000000000100 :
NDIS!ndisCompleteLegacyRequest+0x10e
07 60 fffff880027181e0 fffff88004588d40 : 0000000000000000 0000000000000000 0000000000000000 fffffa801235f010 :
NDIS!NdisMQueryInformationComplete+0xa9

*** NfPktFltr!RequestCompleteHandler called
NdisMQueryInformationComplete therefore this was not our own
IM-generated NDisRequest, we must have been in our miniport edge
QueryInformationHandler.

08 40 fffff88002718220 fffff88000e8a98b : fffffa8008cdca00 fffff800016b9fdc 000000006572444e fffffa8000000230 :
NFPktfltr!RequestCompleteHandler+0x18c

** NfPktFltr!RequestCompleteHandler called therefore NfPktFltr called
NDisRequest

09 30 fffff88002718250 fffff88000e02f33 : fffffa800e24b1a0 0000000000000000 fffffa801235f000 fffffa800e24b1a0 :
NDIS!ndisCompleteOidRequestToRequest+0x5b
0a 30 fffff88002718280 fffff88000e4d91c : fffffa8008cdca00 0000000000000000 fffffa8008cdca30 0000000000000000 :
NDIS!ndisOidRequestComplete+0x393
0b 90 fffff88002718310 fffff88000e9a6de : fffffa8008cdca30 fffff88000e63110 0000000000000000 fffffa8008cdca30 :
NDIS!ndisMOidRequestCompleteInternal+0xdc
0c 80 fffff88002718390 fffff88003c65899 : fffffa800e33e000 fffffa8008cdca78 fffffa8008cdca30 fffffa8000000000 :
NDIS!NdisMOidRequestComplete+0x6e
0d 40 fffff880027183d0 fffff88000e8792c : fffffa800e24b1a0 fffffa8008cdca78 0000000000000000 fffffa800e24b1a0 : hpnd6x64+0x29899

** The address hpnd6x64+0x29899 is offset 0x1e1 from the start of the
hpnd6x64 OidRequestHandler address displayed by ndiskd. The
previous instruction was a call through pointer to
NDIS!NdisMOidRequestComplete as expected. So this looks like it’s
part of the OidRequestHandler of hpnd6x64.

0e 30 fffff88002718400 fffff88000e4d9a6 : 0000000000000000 0000000000000000 0000000000000000 fffffa800e24b1a0 :
NDIS!ndisMDoOidRequest+0x26c
0f b0 fffff880027184b0 fffff88000e9a6de : fffffa8008cdca30 fffff88000e63110 0000000000000000 fffffa8008cdca30 :
NDIS!ndisMOidRequestCompleteInternal+0x166
10 80 fffff88002718530 fffff88003c65899 : fffffa800e33e000 fffffa8008cdca78 fffffa8008cdca30 fffffa8000000000 :
NDIS!NdisMOidRequestComplete+0x6e
11 40 fffff88002718570 fffff88000e8792c : fffffa800e24b1a0 fffffa8008cdca78 0000000000000000 fffffa800e24b1a0 : hpnd6x64+0x29899

** The above four frame sequence repeats 52 more times (of course the stack
pointer progresses in direction of overflow).

e2 30 fffff8800271da20 fffff88000e4d9a6 : 0000000000000000 0000000000000000 0000000000000000 fffffa800e24b1a0 :
NDIS!ndisMDoOidRequest+0x26c
e3 b0 fffff8800271dad0 fffff88000e9a6de : fffffa8008cdca30 fffff88000e63110 0000000000000000 fffffa8008cdca30 :
NDIS!ndisMOidRequestCompleteInternal+0x166
e4 80 fffff8800271db50 fffff88003c65899 : fffffa800e33e000 fffffa8008cdca78 fffffa8008cdca30 fffffa8000000000 :
NDIS!NdisMOidRequestComplete+0x6e
e5 40 fffff8800271db90 fffff88000e8792c : fffffa800e24b1a0 fffffa8008cdca78 0000000000000000 fffffa800e24b1a0 : hpnd6x64+0x29899

*** Going by name we expect ndisMDoOidRequest to call the
OidRequestHandler of an NDIS 6 miniport (or the
QueryInformationHandler of an NDIS 5 miniport, and the
OidRequestHandler of an NDIS 6 LWF). Since hpnd6x64 is an NDIS 6
miniport we expect the OidRequestHandler to be called here.

e6 30 fffff8800271dbc0 fffff88000e0745d : 0000000000000000 fffff800018155a0 fffff88000e63110 fffffa800715dcb0 :
NDIS!ndisMDoOidRequest+0x26c
e7 b0 fffff8800271dc70 fffff8000167d961 : fffff88000e073d0 fffffa8006898680 fffffa8007cf98c0 0000000000000002 :
NDIS!ndisDoOidRequests+0x8d

** Here’s a worker thread running some NDis internal … something to do
with Oid Request-s we suppose?

e8 40 fffff8800271dcb0 fffff80001914c06 : 0000000000000000 fffffa8006898680 0000000000000080 fffffa800681a450 :
nt!ExpWorkerThread+0x111
e9 90 fffff8800271dd40 fffff8000164ec26 : fffff880020a4180 fffffa8006898680 fffff880020af2c0 0000000000000000 :
nt!PspSystemThreadStartup+0x5a
ea 40 fffff8800271dd80 0000000000000000 : fffff8800271e000 fffff88002718000 fffff880027180e0 0000000000000000 :
nt!KxStartSystemThread+0x16

I suspect that you have handled this correctly, but could you show the code
where you call NdisRequest to pass down the query? In particular, how do you
handle the case where status != PENDING.

Thomas F. Divine


From: “Lyndon J. Clarke”
Sent: Thursday, November 25, 2010 9:39 AM
Newsgroups: ntdev
To: “Windows System Software Devs Interest List”
Subject: [ntdev] NDIS6 - mutual recursion between miniport OidQueryHandler
and NdisMOidRequestComplete leading to stack overflow.

> Hello Everyone
>
> I have a bug check (alas, at customer) in the network stack, which is
> easily identified as a stack overflow, but the stack is something the like
> of which I’ve never encountered before in his space. I was wondering if
> any of the NDIS gurus here have seen the like and would be kind enough to
> show some signposts.
>
> To set the stage a little, the adapter is an HP NC 375i (quad port, as it
> happens) which is NetXen/QLogin chipset. The windows version is Server
> 2008 R2. The NDIS 6 adapter miniport driver is hpnd6x64, above which we
> have (mine) NDIS 5 IM Filter NfPktFltr (child of passthru). We also have a
> couple of NDIS 6 LWF in the picture - wfplwf and pacer.
>
> So what happened here?
>
> Well, (mine) NDIS 5 IM filter NfPktFltr QueryInformationHandler() has been
> called, and has returned, after calling NDisRequest() as usual to pass the
> request on down to the lower miniport. The request seems to have been
> given to a system worker thread, and the errant stack is that of the
> system worker thread. The business starts with ndisDoOidRequests(), which
> calls ndisMDoOidRequest(), which in turn calls NDIS 6 MP hpnd6x64
> OidRequestHandler(). Everything as expected up to here.
>
> Now NDIS 6 MP hpnd6x64 OidRequestHandler() calls
> NdisMOidRequestComplete(), which calls ndisMOidRequestCompleteInternal().
> Everything as expected up to here. The surprise is that
> ndisMOidRequestCompleteInternal() then calls back into
> ndisMDoOidRequest(), which in turn calls back into NDIS 6 MP hpnd6x64
> OidRequestHandler(). This mutual recursion happens more than 50 times, and
> thread stack space is getting short now, eventually
> ndisMOidRequestCompleteInternal() calls ndisOidRequestComplete(), which
> calls ndisCompleteOidRequestToRequest(), which in turn calls into NDIS 5
> IM NfPktFltr RequestCompleteHandler(). So NDIS 5 IM NfPktFltr
> RequestCompleteHandler() calls NdisMQueryInformationComplete() and a
> couple of frames later ndisOidRequestComplete() calls KeSetEvent().
> Everything normal at the end here, except of course the stack overflow.
>
> The part here of which I haven’t encountered the like in this space, well
> of course it’s the mutual recursion around the loop NDIS 6 MP
> OidRequestHandler() – NdisMOidRequestComplete() –
> ndisMOidRequestCompleteInternal() – ndisMDoOidRequest() - NDIS 6 MP
> OidRequestHandler(). If anyone has seen this before, can offer any ideas,
> I’d be extremely grateful for signposts.
>
> I’ve put the annotated stack trace below - compressed so as not to show
> the same information 50+ times over :slight_smile:
>
> Thanks,
> Lyndon
>
>
> 8: kd> kv nf
> # Memory Child-SP RetAddr : Args to Child : Call
> Site
> 00 fffff8800211bfe8 fffff8000166fca9 : 000000000000007f <br>&gt; 0000000000000008 0000000080050033 00000000000006f8 : nt!KeBugCheckEx
> 01 8 fffff8800211bff0 fffff8000166e172 : 0000000000000000 <br>&gt; 0000000000000000 0000000000000000 0000000000000000 :
> nt!KiBugCheckDispatch+0x69
> 02 140 fffff8800211c130 fffff8000167486c : 0000000000000000 <br>&gt; 0000000000000000 0000000000000000 0000000000000000 :
> nt!KiDoubleFaultAbort+0xb2 (TrapFrame @ fffff8800211c130)<br>&gt;<br>&gt; **Stack overflow<br>&gt;<br>&gt; 03 5fbed0 fffff88002718000 fffff88000e02fef : 0000000000000000
> fffff88000000000 fffff88000e63700 fffffa8007f22168 : <br>&gt; nt!KeSetEvent+0x1dc<br>&gt; 04 70 fffff88002718070 fffff88000e4d91c : fffff88002115101
> 0000000000000000 fffff8800c9339e0 0000000000000000 : <br>&gt; NDIS!ndisOidRequestComplete+0x44f<br>&gt; 05 90 fffff88002718100 fffff88000e4db9e : fffffa8008a669a0
> 0000000000000000 fffffa800e2e11a0 fffffa8000000230 : <br>&gt; NDIS!ndisMOidRequestCompleteInternal+0xdc<br>&gt; 06 80 fffff88002718180 fffff88000e9a629 : fffffa800e2e11a0
> fffffa801235f010 fffff88000e63110 0000000000000100 : <br>&gt; NDIS!ndisCompleteLegacyRequest+0x10e<br>&gt; 07 60 fffff880027181e0 fffff88004588d40 : 0000000000000000
> 0000000000000000 0000000000000000 fffffa801235f010 : <br>&gt; NDIS!NdisMQueryInformationComplete+0xa9<br>&gt;<br>&gt;*** NfPktFltr!RequestCompleteHandler called<br>&gt; NdisMQueryInformationComplete therefore this was not our own<br>&gt; IM-generated NDisRequest, we must have been in our miniport edge<br>&gt; QueryInformationHandler.<br>&gt;<br>&gt; 08 40 fffff88002718220 fffff88000e8a98b : fffffa8008cdca00
> fffff800016b9fdc 000000006572444e fffffa8000000230 : <br>&gt; NFPktfltr!RequestCompleteHandler+0x18c<br>&gt;<br>&gt; **NfPktFltr!RequestCompleteHandler called therefore NfPktFltr called<br>&gt; NDisRequest<br>&gt;<br>&gt; 09 30 fffff88002718250 fffff88000e02f33 : fffffa800e24b1a0
> 0000000000000000 fffffa801235f000 fffffa800e24b1a0 : <br>&gt; NDIS!ndisCompleteOidRequestToRequest+0x5b<br>&gt; 0a 30 fffff88002718280 fffff88000e4d91c : fffffa8008cdca00
> 0000000000000000 fffffa8008cdca30 0000000000000000 : <br>&gt; NDIS!ndisOidRequestComplete+0x393<br>&gt; 0b 90 fffff88002718310 fffff88000e9a6de : fffffa8008cdca30
> fffff88000e63110 0000000000000000 fffffa8008cdca30 : <br>&gt; NDIS!ndisMOidRequestCompleteInternal+0xdc<br>&gt; 0c 80 fffff88002718390 fffff88003c65899 : fffffa800e33e000
> fffffa8008cdca78 fffffa8008cdca30 fffffa8000000000 : <br>&gt; NDIS!NdisMOidRequestComplete+0x6e<br>&gt; 0d 40 fffff880027183d0 fffff88000e8792c : fffffa800e24b1a0
> fffffa8008cdca78 0000000000000000 fffffa800e24b1a0 : hpnd6x64+0x29899<br>&gt;<br>&gt;** The address hpnd6x64+0x29899 is offset 0x1e1 from the start of the<br>&gt; hpnd6x64 OidRequestHandler address displayed by ndiskd. The<br>&gt; previous instruction was a call through pointer to<br>&gt; NDIS!NdisMOidRequestComplete as expected. So this looks like it's<br>&gt; part of the OidRequestHandler of hpnd6x64.<br>&gt;<br>&gt; 0e 30 fffff88002718400 fffff88000e4d9a6 : 0000000000000000
> 0000000000000000 0000000000000000 fffffa800e24b1a0 : <br>&gt; NDIS!ndisMDoOidRequest+0x26c<br>&gt; 0f b0 fffff880027184b0 fffff88000e9a6de : fffffa8008cdca30
> fffff88000e63110 0000000000000000 fffffa8008cdca30 : <br>&gt; NDIS!ndisMOidRequestCompleteInternal+0x166<br>&gt; 10 80 fffff88002718530 fffff88003c65899 : fffffa800e33e000
> fffffa8008cdca78 fffffa8008cdca30 fffffa8000000000 : <br>&gt; NDIS!NdisMOidRequestComplete+0x6e<br>&gt; 11 40 fffff88002718570 fffff88000e8792c : fffffa800e24b1a0
> fffffa8008cdca78 0000000000000000 fffffa800e24b1a0 : hpnd6x64+0x29899<br>&gt;<br>&gt; **The above four frame sequence repeats 52 more times (of course the <br>&gt; stack pointer progresses in direction of overflow).<br>&gt;<br>&gt; e2 30 fffff8800271da20 fffff88000e4d9a6 : 0000000000000000
> 0000000000000000 0000000000000000 fffffa800e24b1a0 : <br>&gt; NDIS!ndisMDoOidRequest+0x26c<br>&gt; e3 b0 fffff8800271dad0 fffff88000e9a6de : fffffa8008cdca30
> fffff88000e63110 0000000000000000 fffffa8008cdca30 : <br>&gt; NDIS!ndisMOidRequestCompleteInternal+0x166<br>&gt; e4 80 fffff8800271db50 fffff88003c65899 : fffffa800e33e000
> fffffa8008cdca78 fffffa8008cdca30 fffffa8000000000 : <br>&gt; NDIS!NdisMOidRequestComplete+0x6e<br>&gt; e5 40 fffff8800271db90 fffff88000e8792c : fffffa800e24b1a0
> fffffa8008cdca78 0000000000000000 fffffa800e24b1a0 : hpnd6x64+0x29899<br>&gt;<br>&gt;*** Going by name we expect ndisMDoOidRequest to call the<br>&gt; OidRequestHandler of an NDIS 6 miniport (or the<br>&gt; QueryInformationHandler of an NDIS 5 miniport, and the<br>&gt; OidRequestHandler of an NDIS 6 LWF). Since hpnd6x64 is an NDIS 6<br>&gt; miniport we expect the OidRequestHandler to be called here.<br>&gt;<br>&gt; e6 30 fffff8800271dbc0 fffff88000e0745d : 0000000000000000
> fffff800018155a0 fffff88000e63110 fffffa800715dcb0 : <br>&gt; NDIS!ndisMDoOidRequest+0x26c<br>&gt; e7 b0 fffff8800271dc70 fffff8000167d961 : fffff88000e073d0
> fffffa8006898680 fffffa8007cf98c0 0000000000000002 : <br>&gt; NDIS!ndisDoOidRequests+0x8d<br>&gt;<br>&gt; ** Here's a worker thread running some NDis internal ... something to do <br>&gt; with Oid Request-s we suppose?<br>&gt;<br>&gt; e8 40 fffff8800271dcb0 fffff80001914c06 : 0000000000000000
> fffffa8006898680 0000000000000080 fffffa800681a450 : <br>&gt; nt!ExpWorkerThread+0x111<br>&gt; e9 90 fffff8800271dd40 fffff8000164ec26 : fffff880020a4180
> fffffa8006898680 fffff880020af2c0 0000000000000000 : <br>&gt; nt!PspSystemThreadStartup+0x5a<br>&gt; ea 40 fffff8800271dd80 0000000000000000 : fffff8800271e000
> fffff88002718000 fffff880027180e0 00000000`00000000 :
> nt!KxStartSystemThread+0x16
>
>
> —
> NTDEV is sponsored by OSR
>
> For our schedule of WDF, WDM, debugging and other seminars visit:
> http://www.osr.com/seminars
>
> To unsubscribe, visit the List Server section of OSR Online at
> http://www.osronline.com/page.cfm?name=ListServer

Lyndon,

Are all of the 50 or so OID requests that are queued up against the NIC miniport from your IM driver or from some other source?

The upper edge of you IM (the miniport) being NDIS5 should never have more than one request ‘active’ at a time (NDIS will synchronize delivery of requests to the NDIS5 miniport).

Is something hammering requests at the NIC at such a rate that it has exposed this condition (which surely seems to be a flaw in how NDIS is behaving but I cannot confirm that).

And what happens when your IM driver is out of the picture? All works fine?

Cheers,
Dave Cattley

Hello David

Many thanks for your reply.

Are all of the 50 or so OID requests that are queued up against the NIC
miniport from your IM driver or from some other source?

I am struggling a bit with the words “queued up” here on account of the
apparent mutual recursion.

Seems to my mind, if there were a lot of OID requests queued for the
miniport, as an explanation for the 50+ calls to the miniport
OidRequestHandler() … then that would have to mean that there is a place
in ndisMOidRequestCompleteInternal() which takes next request from queue and
calls the miniport OidRequestHandler()? This would have to be a place which
is reachable from the NdisMOidRequestComplete() called from same miniport
OidRequestHandler().

The upper edge of you IM (the miniport) being NDIS5 should never have more
than one request ‘active’ at a time (NDIS will synchronize delivery of
requests to the NDIS5 miniport).

That was my understanding.

Is something hammering requests at the NIC at such a rate that it has
exposed this condition (which surely seems to be a flaw in how NDIS is
behaving but I cannot confirm that).

I am afraid that at this point I am only working from a crash dump, and some
of these questions are not answerable for me.

And what happens when your IM driver is out of the picture? All works
fine?

Sure, if the IM driver is out of the picture, the bug check is not arising
… whether or not the curious apparent mutual recursion is happening, I do
not know.

Best Wishes,
Lyndon

Apologies for jumping in, but can we clarify few points here?

  1. Does this reproduce, or occured only once in unknown circumstances?

  2. If it reproduces: would be useful to know what are these OIDs and what the miniport and IM returned.

IIRC, making a new request from completion path of a previous request is normal,
and yes, this can cause recursion and stack overflow.
But typically stack overflows don’t occur; something unusual triggered the long
chain of requests (spurious link status changes? hardware failure?)

Regards,
– pa

Lyndon,

Are you working with a minidump or a full (or kernel) crash dump? It may be possible to analyze the contents of the Request structures and at least extract the OID and possibly other details.

I think knowing the OID (or OIDs) involved might be a big clue as to why the situation is occurring.

The ndiskd extension can be a great help in analyzing a crash dump. In particular you may gain useful insight by having ndiskd dump state information about the miniport.

Good Luck,
Dave Cattley

The NDIS team has reproduced this issue on Windows 7 (aka Windows Server 2008 R2) when the machine was under *artificially heavy stress*. In that case, it is indeed a theoretical design bug in NDIS, since we’re pumping the OID queue in the context of the completion routine. But I’ve not heard of this repro’ing in a sane workload (e.g., at a customer site). So you might want to inspect your code to make sure that you’re not making the problem worse. In particular, check for OID requests that might take “too long” to complete in your filter, and check that you don’t have any loop that could issue 50 OID requests in very quick succession.

You may find !ndiskd.oid helpful – it lists all queued OID requests at each layer of the stack. What has perhaps gone wrong is that one OID request took an unreasonably long time to complete, and meanwhile hundreds of “quick” requests piled up behind it (e.g., OID_GEN_STATISTICS). The quick ones all complete synchronously, leading to the stack overflow.

Incidentally, the NDIS fix is not scheduled to go out with SP1, since we had never seen it affect real customers.

So a reentrancy check in the old IM 5 driver logic might be helpful? The pAdapt->MPOutstandingRequests field exists, but is never checked. If it was “fancied up” to use interlocked functions it could at least detect this situation and perhaps blow off re-use of the re-entered request.

Thomas F. Divine
http://www.pcausa.com

From: Jeffrey Tippet
Sent: Monday, November 29, 2010 3:37 PM
To: Windows System Software Devs Interest List
Subject: RE: [ntdev] NDIS6 - mutual recursion between miniport OidQueryHandler and NdisMOidRequestComplete leading to stack overflow.

The NDIS team has reproduced this issue on Windows 7 (aka Windows Server 2008 R2) when the machine was under *artificially heavy stress*. In that case, it is indeed a theoretical design bug in NDIS, since we’re pumping the OID queue in the context of the completion routine. But I’ve not heard of this repro’ing in a sane workload (e.g., at a customer site). So you might want to inspect your code to make sure that you’re not making the problem worse. In particular, check for OID requests that might take “too long” to complete in your filter, and check that you don’t have any loop that could issue 50 OID requests in very quick succession.

You may find !ndiskd.oid helpful – it lists all queued OID requests at each layer of the stack. What has perhaps gone wrong is that one OID request took an unreasonably long time to complete, and meanwhile hundreds of “quick” requests piled up behind it (e.g., OID_GEN_STATISTICS). The quick ones all complete synchronously, leading to the stack overflow.

Incidentally, the NDIS fix is not scheduled to go out with SP1, since we had never seen it affect real customers.


NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer

Well perhaps. But I don’t want to tell everyone to run out and change their miniport to work around a theoretical bug in NDIS. It still seems more likely to me that there is something specific the filter driver is doing to make the situation much more likely. It’d be helpful to know repro details (how often does this repro, under what conditions [was the customer running heavy stress?]).

I’m still not convinced that this is really due to the theoretical NDIS bug, since so many filters and miniports have been fine for so many OS releases thus far. In this case, the bug is exhibited between the physical miniport and NDIS itself, so the IM filter above shouldn’t have any effect. But it was stated that the bug doesn’t repro without the IM filter present.

I guess I’m saying that if this only happened once, and only under huge stress, then maybe you can mollify the customer by saying that they were unlucky enough to trip over a rare issue in Windows. But if it happens twice or more, or happens on a largely-idle system, then you have to figure out why your passthru-derivative is broken when all the other passthru-derivatives are not.

From: xxxxx@lists.osr.com [mailto:xxxxx@lists.osr.com] On Behalf Of Thomas F. Divine
Sent: Monday, November 29, 2010 1:01 PM
To: Windows System Software Devs Interest List
Subject: Re: [ntdev] NDIS6 - mutual recursion between miniport OidQueryHandler and NdisMOidRequestComplete leading to stack overflow.

So a reentrancy check in the old IM 5 driver logic might be helpful? The pAdapt->MPOutstandingRequests field exists, but is never checked. If it was “fancied up” to use interlocked functions it could at least detect this situation and perhaps blow off re-use of the re-entered request.

Thomas F. Divine
http://www.pcausa.com

From: Jeffrey Tippetmailto:xxxxx
Sent: Monday, November 29, 2010 3:37 PM
To: Windows System Software Devs Interest Listmailto:xxxxx
Subject: RE: [ntdev] NDIS6 - mutual recursion between miniport OidQueryHandler and NdisMOidRequestComplete leading to stack overflow.

The NDIS team has reproduced this issue on Windows 7 (aka Windows Server 2008 R2) when the machine was under artificially heavy stress. In that case, it is indeed a theoretical design bug in NDIS, since we’re pumping the OID queue in the context of the completion routine. But I’ve not heard of this repro’ing in a sane workload (e.g., at a customer site). So you might want to inspect your code to make sure that you’re not making the problem worse. In particular, check for OID requests that might take “too long” to complete in your filter, and check that you don’t have any loop that could issue 50 OID requests in very quick succession.

You may find !ndiskd.oid helpful – it lists all queued OID requests at each layer of the stack. What has perhaps gone wrong is that one OID request took an unreasonably long time to complete, and meanwhile hundreds of “quick” requests piled up behind it (e.g., OID_GEN_STATISTICS). The quick ones all complete synchronously, leading to the stack overflow.

Incidentally, the NDIS fix is not scheduled to go out with SP1, since we had never seen it affect real customers.


NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer

NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer</mailto:xxxxx></mailto:xxxxx>

Well, sure. I understand that it is apparently rare.

OTOH, other environments (non-MS VMs) may somehow make the theoretical bug more likely under less stressful conditions.

Not really my problem. Just making suggestions. I haven’t seen this myself either.

Thomas F. Divine

From: Jeffrey Tippet
Sent: Monday, November 29, 2010 4:31 PM
To: Windows System Software Devs Interest List
Subject: RE: [ntdev] NDIS6 - mutual recursion between miniport OidQueryHandler and NdisMOidRequestComplete leading to stack overflow.

Well perhaps. But I don’t want to tell everyone to run out and change their miniport to work around a theoretical bug in NDIS. It still seems more likely to me that there is something specific the filter driver is doing to make the situation much more likely. It’d be helpful to know repro details (how often does this repro, under what conditions [was the customer running heavy stress?]).

I’m still not convinced that this is really due to the theoretical NDIS bug, since so many filters and miniports have been fine for so many OS releases thus far. In this case, the bug is exhibited between the physical miniport and NDIS itself, so the IM filter above shouldn’t have any effect. But it was stated that the bug doesn’t repro without the IM filter present.

I guess I’m saying that if this only happened once, and only under huge stress, then maybe you can mollify the customer by saying that they were unlucky enough to trip over a rare issue in Windows. But if it happens twice or more, or happens on a largely-idle system, then you have to figure out why your passthru-derivative is broken when all the other passthru-derivatives are not.

From: xxxxx@lists.osr.com [mailto:xxxxx@lists.osr.com] On Behalf Of Thomas F. Divine
Sent: Monday, November 29, 2010 1:01 PM
To: Windows System Software Devs Interest List
Subject: Re: [ntdev] NDIS6 - mutual recursion between miniport OidQueryHandler and NdisMOidRequestComplete leading to stack overflow.

So a reentrancy check in the old IM 5 driver logic might be helpful? The pAdapt->MPOutstandingRequests field exists, but is never checked. If it was “fancied up” to use interlocked functions it could at least detect this situation and perhaps blow off re-use of the re-entered request.

Thomas F. Divine

http://www.pcausa.com

From: Jeffrey Tippet

Sent: Monday, November 29, 2010 3:37 PM

To: Windows System Software Devs Interest List

Subject: RE: [ntdev] NDIS6 - mutual recursion between miniport OidQueryHandler and NdisMOidRequestComplete leading to stack overflow.

The NDIS team has reproduced this issue on Windows 7 (aka Windows Server 2008 R2) when the machine was under *artificially heavy stress*. In that case, it is indeed a theoretical design bug in NDIS, since we’re pumping the OID queue in the context of the completion routine. But I’ve not heard of this repro’ing in a sane workload (e.g., at a customer site). So you might want to inspect your code to make sure that you’re not making the problem worse. In particular, check for OID requests that might take “too long” to complete in your filter, and check that you don’t have any loop that could issue 50 OID requests in very quick succession.

You may find !ndiskd.oid helpful – it lists all queued OID requests at each layer of the stack. What has perhaps gone wrong is that one OID request took an unreasonably long time to complete, and meanwhile hundreds of “quick” requests piled up behind it (e.g., OID_GEN_STATISTICS). The quick ones all complete synchronously, leading to the stack overflow.

Incidentally, the NDIS fix is not scheduled to go out with SP1, since we had never seen it affect real customers.


NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer

NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer


NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer

Hi Jeffrey

Many thanks for your replies here. The problem reproduces frequently for one
customer, but ‘under what conditions’ is difficult for me to determine.

I had used !ndiskd.oid against the crash dump, but it shows quite little.
There shows one OID_GEN_XMIT_ERROR against the IM, and OID_GEN_STATISTICS
against each of the LWF, and nothing against the miniport.

The IM doesn’t have any loops which issue oid requests. The miniport request
handler and protocol request completion are very similar to passthru (wink),
I cant see anything there.

I am wondering if what we have here is either some how the presence of the
IM is provoking some other thing to snow us with OID requests, and if so how
I would discover the “other thing”.

Best Wishes,
Lyndon

“Jeffrey Tippet” wrote in message
news:xxxxx@ntdev…

Well perhaps. But I don’t want to tell everyone to run out and change their
miniport to work around a theoretical bug in NDIS. It still seems more
likely to me that there is something specific the filter driver is doing to
make the situation much more likely. It’d be helpful to know repro details
(how often does this repro, under what conditions [was the customer running
heavy stress?]).

I’m still not convinced that this is really due to the theoretical NDIS bug,
since so many filters and miniports have been fine for so many OS releases
thus far. In this case, the bug is exhibited between the physical miniport
and NDIS itself, so the IM filter above shouldn’t have any effect. But it
was stated that the bug doesn’t repro without the IM filter present.

I guess I’m saying that if this only happened once, and only under huge
stress, then maybe you can mollify the customer by saying that they were
unlucky enough to trip over a rare issue in Windows. But if it happens
twice or more, or happens on a largely-idle system, then you have to figure
out why your passthru-derivative is broken when all the other
passthru-derivatives are not.

From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Thomas F. Divine
Sent: Monday, November 29, 2010 1:01 PM
To: Windows System Software Devs Interest List
Subject: Re: [ntdev] NDIS6 - mutual recursion between miniport
OidQueryHandler and NdisMOidRequestComplete leading to stack overflow.

So a reentrancy check in the old IM 5 driver logic might be helpful? The
pAdapt->MPOutstandingRequests field exists, but is never checked. If it was
“fancied up” to use interlocked functions it could at least detect this
situation and perhaps blow off re-use of the re-entered request.

Thomas F. Divine

http://www.pcausa.com

From: Jeffrey Tippet

Sent: Monday, November 29, 2010 3:37 PM

To: Windows System Software Devs Interest List

Subject: RE: [ntdev] NDIS6 - mutual recursion between miniport
OidQueryHandler and NdisMOidRequestComplete leading to stack overflow.

The NDIS team has reproduced this issue on Windows 7 (aka Windows Server
2008 R2) when the machine was under artificially heavy stress. In that
case, it is indeed a theoretical design bug in NDIS, since we’re pumping the
OID queue in the context of the completion routine. But I’ve not heard of
this repro’ing in a sane workload (e.g., at a customer site). So you might
want to inspect your code to make sure that you’re not making the problem
worse. In particular, check for OID requests that might take “too long” to
complete in your filter, and check that you don’t have any loop that could
issue 50 OID requests in very quick succession.

You may find !ndiskd.oid helpful – it lists all queued OID requests at each
layer of the stack. What has perhaps gone wrong is that one OID request
took an unreasonably long time to complete, and meanwhile hundreds of
“quick” requests piled up behind it (e.g., OID_GEN_STATISTICS). The quick
ones all complete synchronously, leading to the stack overflow.

Incidentally, the NDIS fix is not scheduled to go out with SP1, since we had
never seen it affect real customers.


NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

That’s interesting. If !ndiskd.oid doesn’t see anything on the (physical) miniport, then there are no OIDs queued on that miniport. Yet NDIS keeps issuing new requests from within NDIS!ndisMOidRequestCompleteInternal. It only does this if there *is* some request queued. In fact, both !ndiskd.oid and NDIS!ndisMOidRequestCompleteInternal are checking the same linked list in NDIS’s internal data structure.

Now perhaps the protocol (in this case, your IM’s protocol edge) is queuing a new OID request from within its OID completion handler. That could explain how NDIS always seems to find a new request on the queue, even though the queue is empty at the time !ndiskd looked at it. Although it seems unlikely that you do any significant amount of work in PtRequestComplete handler, especially for a statistical OID.

Hi Jeffrey

Important to bear in mind that this is debug from crash dump, so whereas
there might be no oid queued at the moment, this doesnt mean there were no
oid queued a few moments earlier.

I have no code in PtRequestComplete path which issues oid requests. You are
right, there is no work done in my PtRequestComplete for a statistical OID.

Thanks again,
Lyndon

“Jeffrey Tippet” wrote in message
news:xxxxx@ntdev…

That’s interesting. If !ndiskd.oid doesn’t see anything on the (physical)
miniport, then there are no OIDs queued on that miniport. Yet NDIS keeps
issuing new requests from within NDIS!ndisMOidRequestCompleteInternal. It
only does this if there is some request queued. In fact, both !ndiskd.oid
and NDIS!ndisMOidRequestCompleteInternal are checking the same linked list
in NDIS’s internal data structure.

Now perhaps the protocol (in this case, your IM’s protocol edge) is queuing
a new OID request from within its OID completion handler. That could
explain how NDIS always seems to find a new request on the queue, even
though the queue is empty at the time !ndiskd looked at it. Although it
seems unlikely that you do any significant amount of work in
PtRequestComplete handler, especially for a statistical OID.