NDIS driver leaking packets in a strange pattern over time

I have a user of my drivers reporting a situation where it runs for a
while (days) but then the network stops working. This only appears to
occur for this one user, but they tell me that they are moving quite a
large amount of data so are probably not typical users. I have sent them
another update to increment a counter when indicating to NDIS and then
decrementing the counter when NDIS calls my ReturnPacket, but the only
place I can see where the leak might occur is that NDIS never returns my
packets to me.

I’ve enabled the checked version of NDIS.sys and turned on the verifier,
but nothing out of the ordinary occurs. At this time I am counting
outstanding packets between Allocate and Free, and the failure goes like
this (the first number is the number of seconds since system boot:

10033: 4 packets are outstanding
10333: 6 packets are outstanding
10633: 9 packets are outstanding
10933: 13 packets are outstanding
11233: 17 packets are outstanding

13233: 106 packets are outstanding
13275: 41 packets are outstanding

The pattern continues with gradual increases (seems to be never by 1,
always by 2-4) to outstanding packets and then a sudden drop, then a
slow increase again, until eventually it hits 1024 packets and
NdisAllocatePacket fails.

It wasn’t until I formatted the above list that I noticed that the
packets start leaking at more or less 10000 seconds after boot (booting
etc might account for 33 seconds before the first message by driver logs
which is where I start counting from), and then exactly every 300
seconds after that. I think the timing starts to drift a bit after a
while but it can’t be a coincidence can it?

Can anyone offer any suggestions as to where I could look? I only have
the one logfile so I can’t tell if starting 10000 seconds is a fluke,
but the 5 minute interval can’t be just chance.

Thanks

James

What happens if you abort the activity after say, the apparent leak hits
700+ (a recognizable number just pulled from thin air) packets and watch if
the system remains static or returns the packets over time?

Basically I am wondering if this is an issue of leakage or congestion
(congestion in the resource/packet recovery process).

And is there a threshold of traffic rate below which this issue does not
occur? Have you tried to reproduce this with traffic tools like iPerf or
some such?

Good Luck,
Dave Cattley

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of James Harper
Sent: Friday, March 18, 2011 4:38 AM
To: Windows System Software Devs Interest List
Subject: [ntdev] NDIS driver leaking packets in a strange pattern over time

I have a user of my drivers reporting a situation where it runs for a
while (days) but then the network stops working. This only appears to
occur for this one user, but they tell me that they are moving quite a
large amount of data so are probably not typical users. I have sent them
another update to increment a counter when indicating to NDIS and then
decrementing the counter when NDIS calls my ReturnPacket, but the only
place I can see where the leak might occur is that NDIS never returns my
packets to me.

I’ve enabled the checked version of NDIS.sys and turned on the verifier,
but nothing out of the ordinary occurs. At this time I am counting
outstanding packets between Allocate and Free, and the failure goes like
this (the first number is the number of seconds since system boot:

10033: 4 packets are outstanding
10333: 6 packets are outstanding
10633: 9 packets are outstanding
10933: 13 packets are outstanding
11233: 17 packets are outstanding

13233: 106 packets are outstanding
13275: 41 packets are outstanding

The pattern continues with gradual increases (seems to be never by 1,
always by 2-4) to outstanding packets and then a sudden drop, then a
slow increase again, until eventually it hits 1024 packets and
NdisAllocatePacket fails.

It wasn’t until I formatted the above list that I noticed that the
packets start leaking at more or less 10000 seconds after boot (booting
etc might account for 33 seconds before the first message by driver logs
which is where I start counting from), and then exactly every 300
seconds after that. I think the timing starts to drift a bit after a
while but it can’t be a coincidence can it?

Can anyone offer any suggestions as to where I could look? I only have
the one logfile so I can’t tell if starting 10000 seconds is a fluke,
but the 5 minute interval can’t be just chance.

Thanks

James


NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

To find out what the packets are, link the outstanding packets in a list, and do a hex dump of data (or headers). You can also timestamp them and dump when they are pending for too long time.

They could be ICMP packets pending reassembly, or like that.

>

To find out what the packets are, link the outstanding packets in a
list, and
do a hex dump of data (or headers). You can also timestamp them and
dump when
they are pending for too long time.

They could be ICMP packets pending reassembly, or like that.

I’m already using some of MiniportReservedEx… I think I would need a
doubly linked list to be able to implement your idea. It could show up
the problem pretty conclusively if the packets are all the same type,
and maybe allow me to reproduce it which would make testing easier, so
I’ll have a go once the user gets back to me with the latest logfile
dump.

James

>

What happens if you abort the activity after say, the apparent leak
hits
700+ (a recognizable number just pulled from thin air) packets and
watch if
the system remains static or returns the packets over time?

Basically I am wondering if this is an issue of leakage or congestion
(congestion in the resource/packet recovery process).

Yes that’s an excellent idea. I’ll put that to the user reporting the
problem.

And is there a threshold of traffic rate below which this issue does
not
occur? Have you tried to reproduce this with traffic tools like
iPerf or
some such?

I’ve tried with iperf and can’t get a failure, but I don’t know that
I’ve ever run it for 10000 seconds (~2.75 hours I think). I’d better get
started :). Also isic/udpsic/tcpsic which throw packets with random
combinations of IP versions, TCP options (valid and invalid) etc.

Thanks

James

As the ‘allocator’ of the packet (NDIS_PACKET) you may also use anything
after the minimum protocol reserved space in ProtocolReserved so long as you
allocate it. In the receive path a miniport provides a packet with the
PROTOCOL_RESERVED_SIZE_IN_PACKET guaranteed to be available to the protocol.
However, it is free to add even more ProtocolReserved space and use it
anyway it wishes.

You can tack all of your packet tracking information in
NDIS_PACKET::ProtocolReserved[PROTOCOL_RESERVED_SIZE_IN_PACKET+SIZE_OF_WHATE
VER_ELSE_YOU_NEED]

Just be sure to initialize your packet pools with the correct protocol
reserved size.

Now of course that will not be available in the Send path but it does not
sound like you have a problem with your Miniport leaking packets :slight_smile:

Good Luck,
Dave Cattley

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of James Harper
Sent: Saturday, March 19, 2011 12:02 AM
To: Windows System Software Devs Interest List
Subject: RE: [ntdev] NDIS driver leaking packets in a strange pattern over
time

To find out what the packets are, link the outstanding packets in a
list, and
do a hex dump of data (or headers). You can also timestamp them and
dump when
they are pending for too long time.

They could be ICMP packets pending reassembly, or like that.

I’m already using some of MiniportReservedEx… I think I would need a
doubly linked list to be able to implement your idea. It could show up
the problem pretty conclusively if the packets are all the same type,
and maybe allow me to reproduce it which would make testing easier, so
I’ll have a go once the user gets back to me with the latest logfile
dump.

James


NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

>

What happens if you abort the activity after say, the apparent leak
hits
700+ (a recognizable number just pulled from thin air) packets and
watch if
the system remains static or returns the packets over time?

I asked them to try this and it appears that the packets are returned
and the outstanding count drops to 0. The sample size of 1 is a bit
small to draw any concrete conclusions from, but it looks promising.

Basically I am wondering if this is an issue of leakage or congestion
(congestion in the resource/packet recovery process).

Me too now :slight_smile:

And is there a threshold of traffic rate below which this issue does
not
occur? Have you tried to reproduce this with traffic tools like
iPerf or
some such?

I just cannot reproduce the problem no matter what I do. My packet use
does get excessive over a short amount of time but then drops down just
as quickly. I think the difference between my testing with iperf and the
customers scenario is that the customer is doing lots of processing on
the data, which may also result in disk i/o, while iperf simply discards
it.

How can I mitigate this congestion problem? Would it be sufficient to
simply drop some packets once the outstanding packets reaches some
threshold?

Or maybe I’m doing too much processing in my DPC. I limit the number of
packets I indicate to NDIS to 32 at a time, but would often call
multiple indications per DPC iteration. It seems that Windows processes
my indicated packets synchronously up to a point, for the first indicate
call at least, so my DPC is probably running excessively long. I think I
should also put in an upper limit to the total number of packets I
indicate in a DPC, and if I hit that limit just reschedule the DPC
immediately and exit…

James

You might also consider when (based on the count of outstanding indications)
to start indicating with NDIS_STATUS_RESOURCES to force bound protocols to
relinquish the packet immediately. Maybe your driver is just too darn
generous with offering packets and it exacerbates the ‘cost’ of dealing with
packets by having long queues of packets somewhere in the system. I’m just
grasping at straws here but ‘unbounded’ generosity is hardly ever rewarded
as expected :slight_smile:

Cheers,
Dave Cattley

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of James Harper
Sent: Wednesday, March 23, 2011 8:27 PM
To: Windows System Software Devs Interest List
Subject: RE: [ntdev] NDIS driver leaking packets in a strange pattern over
time

What happens if you abort the activity after say, the apparent leak
hits
700+ (a recognizable number just pulled from thin air) packets and
watch if
the system remains static or returns the packets over time?

I asked them to try this and it appears that the packets are returned
and the outstanding count drops to 0. The sample size of 1 is a bit
small to draw any concrete conclusions from, but it looks promising.

Basically I am wondering if this is an issue of leakage or congestion
(congestion in the resource/packet recovery process).

Me too now :slight_smile:

And is there a threshold of traffic rate below which this issue does
not
occur? Have you tried to reproduce this with traffic tools like
iPerf or
some such?

I just cannot reproduce the problem no matter what I do. My packet use
does get excessive over a short amount of time but then drops down just
as quickly. I think the difference between my testing with iperf and the
customers scenario is that the customer is doing lots of processing on
the data, which may also result in disk i/o, while iperf simply discards
it.

How can I mitigate this congestion problem? Would it be sufficient to
simply drop some packets once the outstanding packets reaches some
threshold?

Or maybe I’m doing too much processing in my DPC. I limit the number of
packets I indicate to NDIS to 32 at a time, but would often call
multiple indications per DPC iteration. It seems that Windows processes
my indicated packets synchronously up to a point, for the first indicate
call at least, so my DPC is probably running excessively long. I think I
should also put in an upper limit to the total number of packets I
indicate in a DPC, and if I hit that limit just reschedule the DPC
immediately and exit…

James


NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

James,

Is this NDIS 5.x on a pre-Vista/pre-S2008 OS? If so I may be able to share what we found with our NDIS Intermediate driver when there is high Cpu/high disk io combined with large network traffic problems.

Larry C

> James,

Is this NDIS 5.x on a pre-Vista/pre-S2008 OS? If so I may be able to
share
what we found with our NDIS Intermediate driver when there is high
Cpu/high
disk io combined with large network traffic problems.

The user is using 2003 and my driver is NDIS 5.1, so any info you have
on that scenario would be useful!

I modified my driver to process a maximum of 32 packets or a total of
48KB of data (whichever is hit first) per DPC. If it hits that limit it
leaves the (virtual) adapter interrupt disabled and reschedules the DPC
at medium priority. I’m hoping this will give Windows enough time to
catch up between interrupts without sacrificing any performance - I
figure if Windows is already lagging behind then any performance hit is
already being realised anyway! I haven’t received any results of that
testing yet so I don’t know if that has made any difference.

I’ll probably make those two limits tunable.

Thanks

James

>

You might also consider when (based on the count of outstanding
indications)
to start indicating with NDIS_STATUS_RESOURCES to force bound
protocols to
relinquish the packet immediately. Maybe your driver is just too
darn
generous with offering packets and it exacerbates the ‘cost’ of
dealing with
packets by having long queues of packets somewhere in the system. I’m
just
grasping at straws here but ‘unbounded’ generosity is hardly ever
rewarded
as expected :slight_smile:

Sounds plausible :slight_smile:

My packet pool has an upper limit of 1024 packets, but that’s probably a
bit silly - it just raises the memory leak ceiling without actually
doing anything for performance - the worst I see is about 300 packets
outstanding during extreme testing (ping flood, udpsic, udpblast etc) so
512 might be a more sensible limit.

Thanks

James

Maybe ask for a sample of their network traffic using a tool like Wireshark… Perhaps looking at that will give more of a clue as to the type of traffic that you would need to reproduce the problem?

>

Maybe ask for a sample of their network traffic using a tool like
Wireshark…
Perhaps looking at that will give more of a clue as to the type of
traffic
that you would need to reproduce the problem?

I’m pretty sure now that the problem is the volume of traffic not the
type of traffic. Windows is just not processing the packets as fast as
my driver can indicate them, so eventually the queue of unprocessed
packets gets so long that there are none left on the freelist and the
network stops. It seems that when the user stops their network process
windows catches up again so the packets aren’t truly leaked. I’ve also
had one report to the contrary though so I don’t know what to think.

James

James,

The short story is too much disk I/O or high CPU will delay packet delivery between your driver and the application causing packets to be buffered by NDIS/WinsSock.

I can give you the dirty details but that probably should be done offline. We spent alot of test time determining this.

Larry C

>

James,

The short story is too much disk I/O or high CPU will delay packet
delivery
between your driver and the application causing packets to be buffered
by
NDIS/WinsSock.

I can give you the dirty details but that probably should be done
offline. We
spent alot of test time determining this.

What I’m interested in is how to mitigate the problem :slight_smile:

Are there any tricks to avoiding the situation or is it just a matter of
detecting that windows isn’t giving your packets back and dropping a few
until it catches up, or using NDIS_STATUS_RESOURCES as David suggested?

Thanks

James

Pre-vista OS’s use only CPU-0 for processing both disk I/O and network traffic with disk I/O having priority. We saw as much as 30 seconds of delay from our Intermediate driver indicating a packet and an application actually receiving the packet during high disk usage.

If you have any physical memory hogs like SQL Server, make sure they are configured to use maximum of one-half the total physical memory. Why, because low physical memory means Pagefile access, i.e. high disk usage.

Make sure your Pagefile is not fragmented and is set to one and one-half the size of physical memory from the start. Do not set it low and allow it to grow because this may fragment the file. Delete the Pagefile then defrag the disk 3 times to clean it up before re-defining your Pagefile.

I had increased our packet pool to 5120 packets and still ran out of packets sometimes. It seemed to help seting NDIS_STATUS_RESOURCES but that did not cure the problem just made the driver wait for NDIS to copy the packet.

Hope this helps.

Larry C

>

Pre-vista OS’s use only CPU-0 for processing both disk I/O and network
traffic
with disk I/O having priority. We saw as much as 30 seconds of delay
from our
Intermediate driver indicating a packet and an application actually
receiving
the packet during high disk usage.

If you have any physical memory hogs like SQL Server, make sure they
are
configured to use maximum of one-half the total physical memory. Why,
because
low physical memory means Pagefile access, i.e. high disk usage.

Make sure your Pagefile is not fragmented and is set to one and
one-half the
size of physical memory from the start. Do not set it low and allow it
to grow
because this may fragment the file. Delete the Pagefile then defrag
the disk 3
times to clean it up before re-defining your Pagefile.

I had increased our packet pool to 5120 packets and still ran out of
packets
sometimes. It seemed to help seting NDIS_STATUS_RESOURCES but that did
not
cure the problem just made the driver wait for NDIS to copy the
packet.

Hope this helps.

That does help. Thanks.

I would have thought the latency between disk requests and responses
would have been sufficient for Windows to process any outstanding
packets. If any of the processing code was pagable though then I guess
that would cause it to stall.

Thanks again.

James

> Pre-vista OS’s use only CPU-0 for processing both disk I/O and network traffic with disk I/O having priority.

Not so absolutely.

This all depends on hardware interrupt routing, and, even in Pentium-III era, interrupts from disk and network hardware can be delivered to any CPU.


Maxim S. Shatskih
Windows DDK MVP
xxxxx@storagecraft.com
http://www.storagecraft.com

>

> Pre-vista OS’s use only CPU-0 for processing both disk I/O and
network
> traffic with disk I/O having priority.

Not so absolutely.

This all depends on hardware interrupt routing, and, even in
Pentium-III era,
interrupts from disk and network hardware can be delivered to any CPU.

From what I’ve read, NDIS 5.1 always operates on CPU0. RSS allows proper
handling across multiple CPU’s, but that’s only in NDIS6 and above (or
the mythical 5.2 which I’m going to pretend doesn’t exist).

I think part of the problem is that 2003 doesn’t have guaranteed support
for virtual storport drivers so I’m using scsiport, which means disk i/o
is processed at DIRQL which trumps the DISPATCH/PASSIVE code in the NDIS
path. Either that or there is excessive swapping in the packet handling
path and NDIS is actually blocking on paging operations… I get the
feeling that the latter is more likely as the user must have phenomenal
disk I/O for it to saturate the CPU.

James

James - I think 2k3 does actually have Virtual Storport support… Maybe not in the RTM version, but it showed up in a service pack, or some hotfix.