using Winsock Kernel for high performance datagram sockets

James_Harper · November 10, 2013, 10:17pm

>

> But recall that the OP wrote earlier: the packet loss is on sending side.

Given that UDP is by definition unreliable, the NIC driver authors could
trivially do this for their convinience.

IIRC NDIS5 has a queue, and the packets were silently discarded if the queue
was to overflow.

Are there enough stats available to determine if the packet was being discarded by NDIS from its internal queue, or by the miniport driver?

James

Johannes_F · November 11, 2013, 3:15am

Hello,

yes, our receivers are prepared for packet loss and also for packets out of order and we are aware that each component may silently discard UDP packets. But I think it’s a difference whether the rate of dropped packets is about <10 per day (as watched using the embedded devices) or 10 per second.

And for sure there will be some customers trying to use low end switches but most customers will follow our requirements and use Cisco etc. and we can also recommend to use certain NICs.

So what I’m asking for is some hints for getting as close as possible to the loss rate of the embedded devices when using the same network. To achieve this I switched to Winsock Kernel which was a big improvement. But I’m probably not going to write a real hardware driver for a certain NIC.

As a first step I wanted to figure out where exactly the packets are lost, which - for my impression- seems to be somewhere between my call to WskSendTo() and the packet leaving the pc on the wire. Next step is, how to improve this bottleneck- perhaps by modifying certain buffer or QoS settings or properties inside the NICs parameters or inside WinsockKernel.

According to some posts earlier I’ll do some tests with QoS settings and different packet rates and sizes. And I’d like to do some tests with SO_SNDBUF but as mentioned in my first post I’m still receiving this STATUS_INVALID_PARAMETER whenever I try to get/set the buffer size
(exactly as described in http://www.osronline.com/showthread.cfm?link=157209)

Regards,
Johannes

OSR_Community_User · November 11, 2013, 3:40am

If a sendto is done, but the bits never appear in the outgoing wire, then
the packets are probably being sent to the driver faster than it can send
them out. Under the definitions of UDP, it is legitimate for the sender
to drop the packets. It probably doesn’t matter how deep the queue is,
unless the data is bursty. And, as far as I know, there is no way to have
UDP tell you that it s going to drop a packet. In an ideal world, you
could use an async socket and the send callback would not be issued unless
there is enough buffer space, but I don’t know if we live in this ideal
world. But a synchronous socket is free to discard a packet if you push
too hard.

Of course, the problems of UDP fight with the advantages of the protocol,
most particularly broadcast and multicast. There is no good Happy Place
once you commit to it; only places of reduced pain. But you HAVE to
assume that packets will be lost (discarded by sender or receiver, most
commonly on buffer-unavailability), and you HAVE to develop strategies
that deal with this loss.

If you can withstand the loss, recover, and keep on going, that’s one of
the low-pain points. If you must recover the data, you have to have ways
to discover the loss and request retransmission. Ultimately, for 100%
reliability, you end up with a bad implementation of TCP/IP simulated in
UDP (I’ve seen it happen; each problem generated an even more complex
kludge to solve it, until the project toppled into oblivion). Don’t
forget that in addition to missing packets, you have to deal with
out-of-order reception and duplicate packets.

One client I had was using UDP multicast for video. What they did was
send a segment of a scan line, essentially an x,y pair plus the (24-bit
color) data bits. It was inherently insensitive to the occasional lost
packet, because they just used the last set of pixels. A one-frame “tear”
of a few scan lines was unnoticeable. And if there were duplicate packets
or out-of-order packets, the system was inherently insensitive to these.
They did work on a single LAN without going through a router. There were
lost packets, and they gave me a number for packet lossage. It was low,
but I have long since forgotten it. I think this was about 2004, plus or
minus a couple years. I had to develop a special multicast module (and
lab) for my networking course just for them.

Bottom line: you simply CANNOT build a system that assumes UDP gives any
guarantees. Period. It is not, like TCP/IP, as “reliable as a piece of
wire between the two machines”. Even if, under very restricted conditions
which are not guaranteed to be met all the time, it once conveyed the
illusion that it was.

I do not think there is any way to “throttle” UDP sending in a graceful
fashion (that is, by having both a way to detect that you have reached
sender internal capacity, and knowing when you can resume), unless asynch
UDP will not say you can send until there are local sender buffers
available. I’ve never worried about this, so I have no answers.
joe

>
> > But recall that the OP wrote earlier: the packet loss is on sending
> side.
>
> Given that UDP is by definition unreliable, the NIC driver authors could
> trivially do this for their convinience.
>
> IIRC NDIS5 has a queue, and the packets were silently discarded if the
> queue
> was to overflow.
>

Are there enough stats available to determine if the packet was being
discarded by NDIS from its internal queue, or by the miniport driver?

James

NTDEV is sponsored by OSR

Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

Johannes_F · November 11, 2013, 4:14am

Hi Joe,

thanks for your answer - I think our posts just crossed. So you see, we are aware of lost and disordered packets on the receiving side. My goal is to improve the sending side on a pc to be as close as possible to the embedded devices. In other words: does this really have to be a difference of 10 per second compared to 10 per day? Or can I improve this at the level of Windsock Kernel without writing a dedicated NIC hardware driver? ( … or using IntervalZero RTOS platform)

Regards,
Johannes

OSR_Community_User · November 11, 2013, 4:22am

Consider the following options:

If you are not using async sockets, do so. See if this gives any
improvement.

I am not aware of any mechanism on the sender side to say the sender is
about to discard a packet, but you need to explore this possibility. I
don’t hold out much hope for this approach.

Here’s a question for network experts: is sending handled by a
PASSIVE_LEVEL thread in the kernel? If so, what is its priority? If it
is from a PASSIVE_LEVEL thread, would there be any way to run this thread
at higher priority? It may just be that on a heavily-loaded system the
bottleneck is not how fast the bits get on the wire, but how fast packets
are dequeued for sending, and that may be influenced by thread priorities.

Embedded systems are easy: they have nothing else to do but service the
requests, and in general you can prove (absolutely, in the case of cyclic
exec models; and general, schedulable RTOS systems can be modeled using
Rate Monotonic Analysis, or the newer AADLv2 modeling) that they will meet
all necessary timing windows. But general purpose operating systems, such
as Windows, linux, Mac OS X, etc. are not provable using these
techniques, and in general do not consider concepts like shoving network
packets out as fast as possible as one of their design criteria. So you
are expecting Windows (or any other OS that is a general-purpose OS) to
meet requirements that were either never thought of as important, or were
specifically relegated to be unimportant. This is ultimately a losing
game.

There are workarounds: maintain an app-level queue and measure the rate
you are sending packets down. When you have sent “just enough”, stop
sending for some delta-T, then resume. You will want to make the
parameters “tunable” at runtime to deal with specific network drivers or
system loads, so plan on that.

If I were doing this, I would have one thread whose responsibility was
managing the network connection. It would receive packets from all other
threads and queue them. I would then have a dequeue thread that actually
sent the packets. It would use a semaphore to block sending after N
packets were sent. A timer would ReleaseSemaphore() for some value,
perhaps just 1, perhaps 4, 8, or 17. This throttling-by-timer is a bit
hokey, but cleaner than most other models, and it does limit what you send
down. Your tunable parameters, if all packets are the same size, is the
number of packets, the timing interval, and the amount to release the
semaphore. You adjust these until you stop losing packets on send. If
packets are varying sizes, then your value N for the semaphore is computed
based on packet size, or maybe you change the timer or the release count
based on some function of total bytes transmitted (that is, if you have
lots of short packets you may have a lot more buffer space than if you
have nothing but long packets, and with a mix, you have to account for the
kernel usage by a statistical model). For better response time, I might
use a MultiMedia timer (which executes its callback function in a separate
thread), which gives another tunable parameter, the precision of the
timer. For example, I can ask for 5ms resolution, or 0, which gives
maximum resolution at a cost of increased overall system overhead. If the
network stack can’t give you feedback on UDP lossage at transmission time,
you need something like this to simulate a decent throttling algorithm.

Note that you can create a “test load” and run a “set parameters” option
in your program that sends sequence-numbered, and perhaps time-stamped,
test messages to a test client. It starts doing this with a low
transmission rate based on the various throttling parameters, then
increases the rate until the test client starts reporting packet loss.
Thus, the tunable parameters can be self-adjusting for a given network,
machine load, NIC card, and possibly even taking into account the phase of
the moon. This minimizes the need for the end user to twiddle with these
complex parameters.
joe

Hello,

yes, our receivers are prepared for packet loss and also for packets out
of order and we are aware that each component may silently discard UDP
packets. But I think it’s a difference whether the rate of dropped packets
is about <10 per day (as watched using the embedded devices) or 10 per
second.

And for sure there will be some customers trying to use low end switches
but most customers will follow our requirements and use Cisco etc. and we
can also recommend to use certain NICs.

So what I’m asking for is some hints for getting as close as possible to
the loss rate of the embedded devices when using the same network. To
achieve this I switched to Winsock Kernel which was a big improvement. But
I’m probably not going to write a real hardware driver for a certain NIC.

As a first step I wanted to figure out where exactly the packets are lost,
which - for my impression- seems to be somewhere between my call to
WskSendTo() and the packet leaving the pc on the wire. Next step is, how
to improve this bottleneck- perhaps by modifying certain buffer or QoS
settings or properties inside the NICs parameters or inside WinsockKernel.

According to some posts earlier I’ll do some tests with QoS settings and
different packet rates and sizes. And I’d like to do some tests with
SO_SNDBUF but as mentioned in my first post I’m still receiving this
STATUS_INVALID_PARAMETER whenever I try to get/set the buffer size
(exactly as described in
http://www.osronline.com/showthread.cfm?link=157209)

Regards,
Johannes

NTDEV is sponsored by OSR

Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

OSR_Community_User · November 11, 2013, 4:36am

We just crossed again. I saw your latest post after I sent mine, but it
inspired some ideas, which I then sent, and you probably didn’t see before
you sent this.

Ain’t email wonderful?
joe

Hi Joe,

thanks for your answer - I think our posts just crossed. So you see, we
are aware of lost and disordered packets on the receiving side. My goal is
to improve the sending side on a pc to be as close as possible to the
embedded devices. In other words: does this really have to be a difference
of 10 per second compared to 10 per day? Or can I improve this at the
level of Windsock Kernel without writing a dedicated NIC hardware driver?
( … or using IntervalZero RTOS platform)

Regards,
Johannes

NTDEV is sponsored by OSR

Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

Johannes_F · November 11, 2013, 5:22am

Hi Joe,

at the moment I’m passing a block of let’s say 240 UDP packets from my user mode application to my driver. Each packet has it’s own timestamp (in performance count units (I’ve checked HPET is available)). The driver has a worker thread running at passive level and I set the priority to HIGH_PRIORITY. The worker thread sends the UDP packets according to their timestamps and waits until the next packet should be sent. This wait is done by KeWaitForSingleObject() and during this waiting I’m raising the thread to DPC level to achive a better timing. On receiver side I’m watching a jitter of 0.8 millisecs which would be OK. So I’m just trying to improve the packet loss.

Regards,
Johannes

Johannes_F · November 11, 2013, 7:15am

OK, some news: I did a closer look at the statistis of my Cisco switch (SG 300-20) and I found an entry named “Packets with Errors” in the statistics of the port, where I connected my sending pc. This number seemed to increase very close to the lost packets on the receiving PC. A further statistics page called “Etherlike” shows exactly the same number as “Internal MAC Receive Errors”, so I thought it could be a switch issue and I decided to do some tests using a unmanaged switch (Netgear GS108T) … and for now after nearly 30 minutes theres not a single lost packet on the receiving side. (So I’ll have to ask those embedded device people, whether they are really using the same switch and firmware)

Alex_Grig · November 11, 2013, 9:38am

>This wait is done by KeWaitForSingleObject() and during this waiting I’m raising the thread to DPC level

You CANNOT call KWFSO while at DISPATCH_LEVEL. That’s a big violation of rules.

Johannes_F · November 11, 2013, 12:16pm

> You CANNOT call KWFSO while at DISPATCH_LEVEL. That’s a big violation of rules.

I’ve looked at:

http://msdn.microsoft.com/en-us/library/windows/hardware/ff553350(v=vs.85).aspx

which says:

Callers of KeWaitForSingleObject must be running at IRQL <= DISPATCH_LEVEL

So I was not aware of violating rules? … and it seems to work?

Alex_Grig · November 11, 2013, 12:53pm

>Callers of KeWaitForSingleObject must be running at IRQL <= DISPATCH_LEVEL

Then it says:

However, if Timeout = NULL or *Timeout != 0, the caller must be running at IRQL <= APC_LEVEL and in a nonarbitrary thread context. (If Timeout != NULL and *Timeout = 0, the caller must be running at IRQL <= DISPATCH_LEVEL.)

This essentially means that if you’re on DISPATCH_LEVEL, you can only use zero timeout to check for the event. With zero timeout, the function never waits.

OSR_Community_User · November 11, 2013, 1:33pm

Read it more carefully. You can call KeWaitForSingleObject at DISPATCH_LEVEL, but only if you supply a timeout and that timeout is of zero length.

What this means is that you cannot actually wait at DISPATCH_LEVEL. If you’re using a dispatcher object as a lock, you can attempt to acquire that lock at DISPATCH_LEVEL, but if it is owned by some other thread, then you must deal with the fact that you can’t wait for it to be released. So calling KeWaitForSingleObject in this way (with a zero-length timeout) can be used as an attempt to acquire a lock that may fail. If it does fail, you have to deal with that somehow in your code, probably by posting something to a worker thread, or one of the system work item queues.

If calling it with no timeout has “seemed to work” then you’ve just been luck and there has been no contention for your dispatcher object.

Jake Oshins
Windows Kernel Team

-----Original Message-----
From: xxxxx@lists.osr.com [mailto:xxxxx@lists.osr.com] On Behalf Of xxxxx@Freyberger.de
Sent: Monday, November 11, 2013 9:16 AM
To: Windows System Software Devs Interest List
Subject: RE:[ntdev] using Winsock Kernel for high performance datagram sockets

You CANNOT call KWFSO while at DISPATCH_LEVEL. That’s a big violation of rules.

I’ve looked at:

http://msdn.microsoft.com/en-us/library/windows/hardware/ff553350(v=vs.85).aspx

which says:

Callers of KeWaitForSingleObject must be running at IRQL <= DISPATCH_LEVEL

So I was not aware of violating rules? … and it seems to work?

NTDEV is sponsored by OSR

Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer

MBond · November 11, 2013, 4:58pm

The performance of sending 240 UDP packets to a driver via a custom IOCTL
shouldn’t be much different than calling WSASendTo with 240 buffers unless
you have an LSP or some other
UM interference (AV software etc.). Especially if you use IOCP and
SetFileIoOverlappedRange, the data rates you are talking about should not
present a huge problem on good hardware unless you have lots of other stuff
going on

This is totally irreverent to your questions about packet loss, but a
general comment on your approach. Depending on your application,
maintaining a driver may not be necessary and if you can remove it, the
complexity of your code will decrease dramatically

wrote in message news:xxxxx@ntdev…

Hi Joe,

at the moment I’m passing a block of let’s say 240 UDP packets from my user
mode application to my driver. Each packet has it’s own timestamp (in
performance count units (I’ve checked HPET is available)). The driver has a
worker thread running at passive level and I set the priority to
HIGH_PRIORITY. The worker thread sends the UDP packets according to their
timestamps and waits until the next packet should be sent. This wait is done
by KeWaitForSingleObject() and during this waiting I’m raising the thread to
DPC level to achive a better timing. On receiver side I’m watching a jitter
of 0.8 millisecs which would be OK. So I’m just trying to improve the packet
loss.

Regards,
Johannes

anton_bassov · November 11, 2013, 5:03pm

> Callers of KeWaitForSingleObject must be running at IRQL <= DISPATCH_LEVEL

DISPATCH_LEVEL is just a Windows notion of “cannot block” concept. Therefore, trying to wait at DISPATCH_LEVEL simply turns the entire logic that Windows kernel is based upon upside down.
The maximum that you are allowed to do with KeWaitForSingleObject() at DISPATCH_LEVEL is to is to check whether you can lock a semaphore straight away. However, you cannot go blocking if your target is currently unavailable. This is what ‘timeout’ parameter is for…

So I was not aware of violating rules?

Well, you seem to be unaware of the basic kernel-level concepts as well…

… and it seems to work?

Simply because you were lucky enough, up to this point…

Anton Bassov

Pavel_A1 · November 11, 2013, 5:31pm

Have you tried to uncheck the packet scheduler from the LAN connection?
Is the loss rate same with datagrams shorter than 570-something bytes?

– pa

On 11-Nov-2013 10:15, xxxxx@Freyberger.de wrote:

Hello,

yes, our receivers are prepared for packet loss and also for packets out of order and we are aware that each component may silently discard UDP packets. But I think it’s a difference whether the rate of dropped packets is about <10 per day (as watched using the embedded devices) or 10 per second.

And for sure there will be some customers trying to use low end switches but most customers will follow our requirements and use Cisco etc. and we can also recommend to use certain NICs.

So what I’m asking for is some hints for getting as close as possible to the loss rate of the embedded devices when using the same network. To achieve this I switched to Winsock Kernel which was a big improvement. But I’m probably not going to write a real hardware driver for a certain NIC.

As a first step I wanted to figure out where exactly the packets are lost, which - for my impression- seems to be somewhere between my call to WskSendTo() and the packet leaving the pc on the wire. Next step is, how to improve this bottleneck- perhaps by modifying certain buffer or QoS settings or properties inside the NICs parameters or inside WinsockKernel.

According to some posts earlier I’ll do some tests with QoS settings and different packet rates and sizes. And I’d like to do some tests with SO_SNDBUF but as mentioned in my first post I’m still receiving this STATUS_INVALID_PARAMETER whenever I try to get/set the buffer size
(exactly as described in http://www.osronline.com/showthread.cfm?link=157209)

Regards,
Johannes

Johannes_F · November 12, 2013, 4:00am

Yes, it’s quite clear, I didn’t read carefully enough the KeWaitForSingleObject() documentation. Thanks for your tips. But now I’m nearly surprised that I’ve never seen a IRQL_NOT_LESS_OR_EQUAL, which I’d have assumed to be the result of such a mistake? Or did the Kernel just replace my timeout parameter with a value valid at DISPATCH_LEVEL?

What would be the recommended way to have a worker thread wait for 20 microseconds to achive a low jitter on sending the packets? I think KeStallExecutionProcessor() is more for single shot waits but not for a loop? At the moment I’m watching a jitter of 0.8 millisecs, which is OK, but if there’s a chance to improve the jitter, I’d be very interested how this could be done.

Regards,
Johannes