Is it possible to NdisSend() too fast?

Hello all, long time lurker who greatly appreciates the help that comes up
from this list.

I have a simple question. I have an NDIS IM driver based on the James A.
example code. In it I look at messages and occasional send a response. I
have Wireshark installed on both systems (requester and responder) and am
watching the traffic in both. What’s happening now is that my request will
send one packet seen by both sides, and then my responder will send back two
packets. On the responder itself, wireshark says it sees both packets, that
their IP/UDP checksums are fine, etc. But on the requester side, I see a
non-deterministic number of packets. Sometimes 0, 1, or 2. The two machines
are both on the same switch (I don’t know what kind unfortunately) and have
the same /24 addresses.

So my question is, is it possible that NdisSend is sending packets too
quickly and its sometimes causing a collision or something?(The show up in
wireshark on the responder as being 42 microseconds apart, but still
that seems fine to me). Is it possible that it’s just a faulty switch?

Billy Volta Esq.

Also, I though I disproved this, but now I have to say that I have strong
evidence that the packet which isn’t seen on the requester side is a packet
for which NdisSend returned NDIS_STATUS_SUCCESS rather than
NDIS_STATUS_PENDING (i.e. I just had a request which sends back 3 ascii
messages in a known order, and the DbgPrints indicated that the order was
PENDING/SUCCESS/PENDING and although all 3 showed up in wireshark on the
sender, the middle one was missing on the receiver.

Billy Volta Esq.

On Tue, Jan 19, 2010 at 6:57 PM, Billy Volta wrote:

> Hello all, long time lurker who greatly appreciates the help that comes up
> from this list.
>
> I have a simple question. I have an NDIS IM driver based on the James A.
> example code. In it I look at messages and occasional send a response. I
> have Wireshark installed on both systems (requester and responder) and am
> watching the traffic in both. What’s happening now is that my request will
> send one packet seen by both sides, and then my responder will send back two
> packets. On the responder itself, wireshark says it sees both packets, that
> their IP/UDP checksums are fine, etc. But on the requester side, I see a
> non-deterministic number of packets. Sometimes 0, 1, or 2. The two machines
> are both on the same switch (I don’t know what kind unfortunately) and have
> the same /24 addresses.
>
> So my question is, is it possible that NdisSend is sending packets too
> quickly and its sometimes causing a collision or something?(The show up in
> wireshark on the responder as being 42 microseconds apart, but still
> that seems fine to me). Is it possible that it’s just a faulty switch?
>
> Billy Volta Esq.
>

What do you mean by “visible on requester side”??? Do you mean invocation of your protocol’s PtSendComplete() handler?

A quick look into NDIS.SYS will reveal that NdisSend() is just a macro - when you call NdisSend() you call one of NDIS library’s functions that forwards the packet to underlying miniport driver and returns whatever status miniport’s send() handler does. If miniport driver’s send() handler returns NDIS_STATUS_PENDING it means that it is going to indicate send completion at some point, and, when it happens, your protocol’s PtSendComplete() handler will get invoked by NDIS so that you will find out what the actual completion status is. However, if it returns any other status it means that request got completed synchronously, so that its completion is not going to get indicated . Therefore, if you judge “packet visibility on the sender side” by PtSendComplete() handler’s invocation, in this case the packet will appear to be missing.

In other words, everything seems to be working the way it is meant to work…

Anton Bassov

For sure it is possible to “send too fast”.

Getting any return code from NdisSend is meaningless. It’s a “lying send”. Ethernet (and the miniport driver/hardware) are not “reliable”. Getting a SUCCESS from a call to NdisSend doesn’t mean that the packet even left the host. Failure may be returned if there are resource problems or hardware failures - but I don’t actually recall seeing a failure form NdisSend.

In addition, Wireshark is not God either. Wireshark on the sender side will probably show all packets because on the local host the “sent” packets are looped back to Wireshark by NDIS software; no loss there. Wireshark on the receiver or on a third node tap provides a more realistic view of the packets.

Yep, one (or more) was certainly lost.

You can get the same packet loss in user-mode using UDP. Just send a bunch of UDP packets with a counter or such in the payload. The receiver will miss some (possibly a LOT).

So packet loss when “sending too fast” is not specifically related to sending from within your driver.

Good luck,

Thomas F. Divine
http://www.pcausa.com

From: Billy Volta
Sent: Tuesday, January 19, 2010 7:02 PM
To: Windows System Software Devs Interest List
Subject: Re:[ntdev] Is it possible to NdisSend() too fast?

Also, I though I disproved this, but now I have to say that I have strong evidence that the packet which isn’t seen on the requester side is a packet for which NdisSend returned NDIS_STATUS_SUCCESS rather than NDIS_STATUS_PENDING (i.e. I just had a request which sends back 3 ascii messages in a known order, and the DbgPrints indicated that the order was PENDING/SUCCESS/PENDING and although all 3 showed up in wireshark on the sender, the middle one was missing on the receiver.

Billy Volta Esq.

On Tue, Jan 19, 2010 at 6:57 PM, Billy Volta wrote:

Hello all, long time lurker who greatly appreciates the help that comes up from this list.

I have a simple question. I have an NDIS IM driver based on the James A. example code. In it I look at messages and occasional send a response. I have Wireshark installed on both systems (requester and responder) and am watching the traffic in both. What’s happening now is that my request will send one packet seen by both sides, and then my responder will send back two packets. On the responder itself, wireshark says it sees both packets, that their IP/UDP checksums are fine, etc. But on the requester side, I see a non-deterministic number of packets. Sometimes 0, 1, or 2. The two machines are both on the same switch (I don’t know what kind unfortunately) and have the same /24 addresses.

So my question is, is it possible that NdisSend is sending packets too quickly and its sometimes causing a collision or something?(The show up in wireshark on the responder as being 42 microseconds apart, but still that seems fine to me). Is it possible that it’s just a faulty switch?

Billy Volta Esq.

— NTDEV is sponsored by OSR For our schedule of WDF, WDM, debugging and other seminars visit: http://www.osr.com/seminars To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer

> Getting a SUCCESS from a call to NdisSend doesn’t mean that the packet even left the host.

Oops, Calvin is not going to be happy at all when he reads it - IIRC, once he said that anyone who reports
successful completion before the packet went on the wire should be sacked on the spot…

Yep, one (or more) was certainly lost.

If I understood the OP properly, what he actually says here is that sender is unable to see all packets that it had sent while the receiver got them all, which made me wonder about the very way the problem is presented. The very first thing that gets into my head is that “visibility on the sender side” in this context means invocations of PeSendComplete(), because, judging from his description, packets seem to get across the wire just fine…

Anton Bassov

Calvin’s has the right approach, but I still say that most calls to NdisSend are “lying sends” in reality.

Oops. you are right that I may have misread the OP. Or, the behavior I described happens on the receiver side. Perhaps Calvin’s adapter is the sender but a not-so-good HW/driver on the receiver/responder side.

Of course using a crossover cable could eliminate the question about the switch.

Good night to you, Anton!

Thomas

No, I’m not unhappy, Anton. Mr. Divine is right on “lying send”.

When a miniport returns status_success on send, it usually means the chip
has finished DMA’ing the data described by the MDL/sgl to its TX fifo. By
returning status_success, it tells the caller “I don’t need the MDL anymore,
feel free to nuke it or do whatever you want and the memory associated with
it”. Failure to do so would cause the HW to DMA from the wrong/bad memory
hence causing nasty data corruptions.

Returning success, of course does not guarantee that at the time the packet
had make it to the wire or received by the remote end. The packet may be:

  1. stuck in the internal TX FIFO because it’s Xoff’ed from the receiver in
    full duplex link.
  2. dropped by the PHY if a collision is detected on half-duplex link.
  3. dropped by either switch along the path or the receiving end due to any
    reason, i.e. RX internal FIFO is extremely low.

Note that the receiver could have plenty of host RX buffers but could still
lose packets if it’s has shallow RX FIFO, inefficient DMA write engine, or
simply running short of PCIe credit to DMA. This by no mean a complete list
of “what that would happen”.

Calvin

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of
xxxxx@hotmail.com
Sent: Tuesday, January 19, 2010 5:21 PM
To: Windows System Software Devs Interest List
Subject: RE:[ntdev] Is it possible to NdisSend() too fast?

Getting a SUCCESS from a call to NdisSend doesn’t mean that the packet
even left the host.

Oops, Calvin is not going to be happy at all when he reads it - IIRC, once
he said that anyone who reports
successful completion before the packet went on the wire should be sacked on
the spot…

Yep, one (or more) was certainly lost.

If I understood the OP properly, what he actually says here is that sender
is unable to see all packets that it had sent while the receiver got them
all, which made me wonder about the very way the problem is presented. The
very first thing that gets into my head is that “visibility on the sender
side” in this context means invocations of PeSendComplete(), because,
judging from his description, packets seem to get across the wire just
fine…

Anton Bassov


NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

> Calvin’s has the right approach, but I still say that most calls to

NdisSend are “lying sends” in reality.

I agree with Calvin, driver developers should assume packet send completion
from the hardware simply mean the controller is done DMAing your buffer.
There are LOTS and LOTS of places for packets to get lost before you get a
packet received indication at the other end.

A related question is: what does packet send completion for a virtual NIC in
a VM actually mean?

I’m really curious what happens on Hyper-V (and other virtual system)
virtual NIC’s when the physical nic can’t transmit as fast as all the VM’s
trying to send packets. Do you get head of line blocking in the virtual
switch on ALL the virtual nics for that switch? And QoS badly degrades? Or
does the virtual switch start dropping packets on some virtual nics,
attempting to prevent every virtual nic from blocking? Or can the virtual
switch somehow throttle the send packet flow from virtual nics such that
packets are never dropped and everybody get’s fair access to the physical
nic (which seems difficult since physical nics often just queue send
packets, and the sender has no idea how full these queues are). Some nics
(lots of nics?) queue packets in software when there is no space in the
hardware queues, potentially allowing very large queue depths (which is very
bad for latency, which causes the RTT time to climb and TCP throughput to
fall). I don’t actually know the tradeoffs on TCP throughput of dropping
packets vs. huge queues. Some documentation on (desired) vmswitch behavior
would be helpful so we understand the interaction between nic drivers and
the virtual switch. Are there things drivers could do to better cooperate
with vmswitch?

Really packet flow control needs to happen at the connection flow level not
the adapter packet level, which is a good argument in favor of TOE adapters
(which at least for TCP connections, can know the internal queue levels and
apply round-robin or bandwidth limits per flow). TOE can also send TCP acks
at higher priority and without the latency of interrupting the host,
reducing the RTT. Efficiently dealing with network flow control and
congestion is just a huge sticky area (and at the core of the whole net
neutrality debate).

Jan

The only thing that a driver [developer] can ‘assume’ (or be guaranteed)
from a NDIS send completing is that the resources have been returned.

Period.

Cheers,
Dave Cattley

> > Calvin’s has the right approach, but I still say that most calls to

> NdisSend are “lying sends” in reality.

I agree with Calvin, driver developers should assume packet send
completion
from the hardware simply mean the controller is done DMAing your
buffer.
There are LOTS and LOTS of places for packets to get lost before you
get a
packet received indication at the other end.

A related question is: what does packet send completion for a virtual
NIC in
a VM actually mean?

For my Xen drivers, I give the ‘back end’ a reference to the physical
memory pages, so I only SendComplete when the ‘back end’ says it’s
finished with the memory. From there, it could get queued up:
. On the physical network adapter
. Waiting for space on a physical network adapter
. On another virtual network adapter
. Waiting for space on a virtual network adapter

The back end driver is considered ‘trusted’, so it is safe to give it a
page of memory that might have data on it other than packet buffers.
Because the ‘other data’ might contain sensitive information, and it
would be devastating if it was written to, if the packet data goes to
another virtual machine it is always copied first. This also means that
I get my buffer pages back pretty quickly.

The current Xen shared communication ring is a bit undersized for really
fast communications (the new high performance implementation has just
hit RC status I believe). In the testing between my Windows VM and the
‘back end’ (eg no physical adapter involved) it peaks at about
5GBits/second under ideal conditions on my test boxes (very low end HP
servers). Quite a few packets get dropped along the way because the
buffers fill up between my rx dpc’s getting run. It’s plenty fast enough
to saturate a gigabit physical adapter though.

James

You’re asking a bunch of questions here. I could answer most of them from
the perspective of the current Hyper-V release or the ideal virtualization
system. Practice never quite catches up with theory, and in this case,
we’ve got quite a long way to go. I’ll assume that you’re primarily curious
about the current Hyper-V release (Windows Server 2008 R2 - for the
archives) simply because that’s the only production-class virtualization
system today that uses Windows drivers in anything but the guest VM.

The Hyper-V VM Switch is a layer-2 switch. The physical NIC is essentially
the uplink port. Packets are only sent to it when the switch has already
decided that those packets deserve to make it on the wire, by one criteria
or another.

We have experimented with various ways of managing fairness and flow control
between guest VMs and the physical port. Some were pretty high in overhead
and gave us performance characteristics that we didn’t like. Furthermore,
the current version of Hyper-V makes no attempt to strongly manage quality
of service and it’s not even possible to configure a VM NIC or a switch port
with a QoS statement. So we’ve found that simple mechanisms that generate
some approximation of fairness, but which are very cheap to implement, are
the currently the best fit.

To whit, each VM NIC has a corresponding buffer in the guest VM. This
buffer is visible to the VM switch to which the VM NIC is attached. The
guest OS can queue up as many packets in that buffer (where larger packets
are essentially described by MDLs, not copied) as fit within it. The VM NIC
must stop queuing when the buffer is full, as there’s nowhere else to put
anything.

The VM switch rotates among the VM NICs looking for work to do, pulling
large lists of packets off each one in turn. Control structures and
Ethernet headers are copied through the VM switch’s internal buffers, which
are finite. It’s entirely possible that it will drop packets in certain
conditions.

The VM switch indicates packets to the physical NIC using MDL chains that
describe headers in the VM switch and higher-layer protocols and data in the
VM’s shared buffer. (Again, whether the higher-layer stuff is just copied
or described by a MDL depends on how big it is. The exact size is something
we will continue to tune.) These structures aren’t freed until the NIC
returns the net buffers back up the stack. (The VM can mess with any part
of it that it can see, which is why the layer-2 stuff must come from buffers
that that VM cannot see.)

I don’t think that there’s much you can do in your NICs and miniports to
help the VM switch beyond what Ali Dabagh has described at various WinHECs
and DDCs. Offloading performance counters is really nice. Header/Data
split is key. VMQ is very useful. I suspect that some of you will be
tempted to buffer up packets in software so that the VM switch won’t drop
packets as soon as it would otherwise. While you could certainly do this,
it would probably just tie up memory that only makes a difference for the
brief moment between when the VM switch’s buffers overflow and when yours
do.

Most of what you say below is true. Useful traffic shaping and flow
management is hard. Perhaps we’ll take that on in the next release.


Jake Oshins
Hyper-V I/O Architect
Windows Kernel Group

This post implies no warranties and confers no rights.


“Jan Bottorff” wrote in message
news:xxxxx@ntdev…
>> Calvin’s has the right approach, but I still say that most calls to
>> NdisSend are “lying sends” in reality.
>
> I agree with Calvin, driver developers should assume packet send
> completion
> from the hardware simply mean the controller is done DMAing your buffer.
> There are LOTS and LOTS of places for packets to get lost before you get a
> packet received indication at the other end.
>
> A related question is: what does packet send completion for a virtual NIC
> in
> a VM actually mean?
>
> I’m really curious what happens on Hyper-V (and other virtual system)
> virtual NIC’s when the physical nic can’t transmit as fast as all the VM’s
> trying to send packets. Do you get head of line blocking in the virtual
> switch on ALL the virtual nics for that switch? And QoS badly degrades? Or
> does the virtual switch start dropping packets on some virtual nics,
> attempting to prevent every virtual nic from blocking? Or can the virtual
> switch somehow throttle the send packet flow from virtual nics such that
> packets are never dropped and everybody get’s fair access to the physical
> nic (which seems difficult since physical nics often just queue send
> packets, and the sender has no idea how full these queues are). Some nics
> (lots of nics?) queue packets in software when there is no space in the
> hardware queues, potentially allowing very large queue depths (which is
> very
> bad for latency, which causes the RTT time to climb and TCP throughput to
> fall). I don’t actually know the tradeoffs on TCP throughput of dropping
> packets vs. huge queues. Some documentation on (desired) vmswitch behavior
> would be helpful so we understand the interaction between nic drivers and
> the virtual switch. Are there things drivers could do to better cooperate
> with vmswitch?
>
> Really packet flow control needs to happen at the connection flow level
> not
> the adapter packet level, which is a good argument in favor of TOE
> adapters
> (which at least for TCP connections, can know the internal queue levels
> and
> apply round-robin or bandwidth limits per flow). TOE can also send TCP
> acks
> at higher priority and without the latency of interrupting the host,
> reducing the RTT. Efficiently dealing with network flow control and
> congestion is just a huge sticky area (and at the core of the whole net
> neutrality debate).
>
> Jan
>
>

>

You’re asking a bunch of questions here. I could answer most of them
from
the perspective of the current Hyper-V release or the ideal
virtualization
system. Practice never quite catches up with theory, and in this
case,
we’ve got quite a long way to go. I’ll assume that you’re primarily
curious
about the current Hyper-V release (Windows Server 2008 R2 - for the
archives) simply because that’s the only production-class
virtualization
system today that uses Windows drivers in anything but the guest VM.

The Hyper-V VM Switch is a layer-2 switch. The physical NIC is
essentially
the uplink port. Packets are only sent to it when the switch has
already
decided that those packets deserve to make it on the wire, by one
criteria
or another.

Just curious… the layer-2 switch equivalent under Linux can receive
packets from a VM that are ‘large’ (eg 60Kbytes) and ‘unchecksummed’, as
in lso/tso/gso and checksum offloading. If the L2 routing sends the
packet to a physical adapter and the physical adapter supports the
offloading then it handles it. If the L2 routing sends the packet to
another VM and the VM reports that it supports offloading, then it can
accept the packet also in its ‘large’ and ‘unchecksummed’ state, saving
breaking up the packet into MSS sized chunks and calculating checksums
on it.

Does HyperV do anything similar?

This works really really well under Linux - one ETH/IP/TCP header per
60K of data instead of one per 1.5K of data, and no checksum calculation
required.

Under Windows, the sending of ‘large’ and ‘unchecksummed’ packets to the
xen backend switch works really well, but I just can’t get Windows to
receive said packets. It refuses to accept a packet larger than MTU so I
have to break it up into MSS+header sized chunks myself, and it also
refuses to believe me when I tell it that the packet checksum is
validated - it goes and checks itself anyway, so I have to calculate it
too. It kind of sucks. Maybe it’s not your area, but is this the way
NDIS is supposed to work?

Thanks

James

How about VMQ Loohahead split? Is it more of a performance feature of a
security feature?

Thanks,
Calvin

> the layer-2 switch equivalent under Linux

IIRC, as recently as two days ago you were complaining about SNR on this list. (which, of course, was blamed on me)…

Anton Bassov

> By returning status_success, it tells the caller "I don’t need the MDL anymore, feel free

to nuke it or do whatever you want and the memory associated with it".

As far as I recall our discussion, this is EXACTLY what I was saying - the only thing NDIS_STATUS_SUCCESS implies is that packet resources are not needed by miniport driver any more, which cannot be taken as an indication that the packet ACTUALLY went on wire; and you were claiming that reporting a packet before it reaches the wire is “professional malpractice”. This is why I immediately recalled our discussion ( with a certain degree of irony, of course) when I saw Thomas’s statement…

Anton Bassov

>

> the layer-2 switch equivalent under Linux

IIRC, as recently as two days ago you were complaining about SNR on
this
list.

Apologies to anyone who thought it was a complaint - it was just an
observation. I wouldn’t have this list any other way.

I brought up the Linux bridging code because that was what I understood,
and wanted to know if Windows worked in a way that was at all similar
and if it had similar features.

(which, of course, was blamed on me)…

If someone mentioned your name, I missed it. In any case it takes at
least two people to sustain the sort of arguments that have arisen on
this list in days gone by…

James

On Tue, Jan 19, 2010 at 8:20 PM, wrote:

>
>
> > Getting a SUCCESS from a call to NdisSend doesn’t mean that the packet
> even left the host.
>
> Oops, Calvin is not going to be happy at all when he reads it - IIRC, once
> he said that anyone who reports
> successful completion before the packet went on the wire should be sacked
> on the spot…
>
> > Yep, one (or more) was certainly lost.
>
> If I understood the OP properly, what he actually says here is that sender
> is unable to see all packets that it had sent while the receiver got them
> all,

You got it backwards Anton, I had said “On the responder itself, wireshark
says it sees both packets… But on the requester side, I see a
non-deterministic number of packets”

So if it is the case as Thomas says that “on the local host the “sent”
packets are looped back to Wireshark by NDIS software” that’s why the
responder is seeing all the packets, but the requester, being on the same
switch, sees that not all of the packets were actually sent.

I’m probably just going to add a slight delay to the sending of packets and
see if that helps anything. Thanks to everyone for the help, and the
interesting conversation that ensued. I always like when details are
discussed :slight_smile:

Billy Volta Esq.