High rate Ethernet packet driver suggestion

Hello,
I’m not a windows driver developer so I’m here to ask you a suggestion.
I’ve developed a FPGA based system. It exchange information using a Gigabit NIC.
It can run in a couple a ways: RAW Ethernet or UDP packets.

The FPGA system generate a packet and sent it to the PC. The PC has to read the packet, does some math and reply this packet as faster as possible. Each packet payload contains FPGA to PC data and vice-versa. Also each packet header contains a unique ID, and some useful stats like the “flyback” time. The flyback time is the time that the PC takes to reply to the packet with id N.

Packets payload is almost 350 bytes.
Packets rate should be at least 2kHz, nice to have 10kHz.

We are currently use a client software written in windows using National Instruments PCIe board with NI drivers. We are building a new electronic.
The client software is written in windows, so I can not simply go to Linux with RT.

NPCap is simply not fast enough, the flyback time at 1kHz is not stable enough it goes up and low from 100us to more than 1ms.

For this reason I’m thinking about writing a kernel driver.
This driver should:
1 - get the packet
2 - deparse the packet
3 - move the packet to user area (copying to GUI client memory?)
4 - read back the computed packet (from memory?)
5 - parse the packet
6 - send the packet back
At first stage the step 3 and 4 can be embedded in the driver as a simple loopback.

I’m looking at NDIS proto sample, and at the WSK echo sample.
Because I’m really out of this kind of development, before starting experimentation I would like to ask you suggestions.

Thanks!

Anytime someone starts talking about timing in terms of frequency, I know there is going to be a problem.

for better or worse, the NT kernel’s design ethos is about throughput and not consistent latency

The scheduler normally operates with a granularity of 10-15 ms. The multi-media APIs can reduce that to around 1ms, but that provides a theoretical maximum of ~1kHz. And that’s under ideal conditions.

Considered wisdom is to change your program not to need precision like this, or if you can’t, use some kind of FPGA. Writing a KM driver to process UDP / Ethernet frames will not substantially alter the timing versus a well written UM program

Thanks for reply.

Our current software runs in UM, and is able to have a round trip time to from 100us to 200us almost, but it uses NI Device Drivers for a NI PCIe board. So we where thinking replace only the module of our software that speaks with the hardware, using the new UDP hardware.
Note that for our purpose we have to process every single packet, we can not bufferize packets and send it back, so that’s the reason for the frequency / latency request.

@lkdg said:
I’m not a windows driver developer so I’m here to ask you a suggestion.
I’ve developed a FPGA based system. It exchange information using a Gigabit NIC.
It can run in a couple a ways: RAW Ethernet or UDP packets.

A few things to keep in mind. First, if your protocol is not IP then you must use a kernel driver. Windows does not have a raw socket type that would allow you to deal with ethernet packets in usermode. If you are using UDP then you have the choice of packets in kernel mode or UDP datagrams in user mode. The usermode UDP and TCP stacks are pretty optimized and even with a kernel driver you will have to put in some effort to match their potential. Try using something like iperf3 to benchmark your environment with TCP and UDP. With iperf and TCP I can hit 9.5Gb on cheap, 5-year old 10GbE hardware.

Second, maximizing the performance is going to be as much (or more) about the processing strategy than the underlying protocol. Any sort of queuing and blocking is going to kill your ability to achieve the highest levels of throughput. As mentioned, the Windows scheduler is too granular for that so you’d need to employ a mechanism that does spin waits or similar. Is it possible for the packet to be processed entirely in kernel mode? If you really want to see what potential you have I’d start there. Take something like the NDIS LWF sample and see what kind of rates you get in your environment just doing an immediate packet send in response to the packet received.

Thanks.

A few things to keep in mind. First, if your protocol is not IP then you must use a kernel driver.

For the tests on RAW Ethernet I’m using npcap. UDP has similar results in terms of timing.
I can’t use iperf against the FPGA implementation, I’ve test windows PC to linux PC iperf, using the same windows card I’m using for the UDP version of the FPGA to PC communication. Results are good it’s: 1.10 GBytes transferd in 10 sec, at 947 Mbits/sec, with a Jitter of 0.084 ms 0 lost packet with 144522 sent.

Packets unfortunately can not be processed in kernel, cause the root of our current software is how the packets are processed. But I can try a test driver to see just check a “loopback” packet round trip time.
I’ll take a look at NDIS LWF.

I would like to ask you a couple a things more:

  1. FPGA side I’ve a Gigabit PHY, do you think a 10BbE like the a X540-T2 based can help on Windows side?
  2. Do you think Windows IoT will help performances? I’m going to try it in the next days.
  3. kithara.com claims “real-time-for-windows” and “real-time-communication” performance on RAW and UDP too. Have you ever heard about this company?

Thanks again for any help.

(1) Seems like an odd question. A 10GbE phy will get 10X the throughput, assuming you have a 10Gb network infrastructure. If your routers and switches are all gigabit, then it isn’t going to change a thing.

(2) Another odd question. Windows IoT uses exactly the same kernel as desktop Windows. It just has some features and subsystems disabled.

(3) Kithara is almost a hypervisor; it essentially runs Windows as one of its clients, the way Windows 3.x used to work with VxDs. I’m sure it works, but It’s not an option if you’re selling devices, because all of your clients would have to install their operating system.

1 Like

First, Windows does support non-IP protocols. They are rare, and not very useful given that they are generally not routable beyond a single network segment or LAN.

npcap is a library designed for packet capture. I don’t know the details of the implementation, but for that problem, the latency of each packet is irrelevant - The optimization goals are to minimize the impact on the system as a whole, and maximize throughput so that captured packets aren’t lost

You say that you have tried using UDP. How have you tried to use UDP? You may not think so, but there can be an enormous difference in results depending on the threading / IO model in use. To achieve good results, you want to use an IOCP or the thread pool APIs that are built on this model. For reference, I worked on a project with Windows XP / Server 2003 where I was able to saturate a 1Gb/s link with UDP traffic for a stateless protocol (DNS query). In those days, it was I44BX chipset with dual 1GHz CPUs. Your phone probably has more computational power than that

But the problem is not saturating a 1Gb/s link. Your problem is latency and jitter in that latency. Multi-media timers and thread priority can help you if 99% is good enough for your application. If you need 100% you are sunk on Windows

Thank you Tim_Roberts and MBond2

@Tim_Roberts

  1. I know 10 is ten times 1 :slight_smile: But I’m wondering if the driver or the internal NIC for a 10GbE performs better in terms of latency then a 1GbE
  2. I’m going to try IoT, maybe the disabled features makes the system reactive
  3. Could be an option, I have to investigate further mailing Kithara

@MBond2
We are using npcap till now for capturing, as it’s design for it. And for that specific task we don’t mind latency, so it works good. But on sending there’s some occasional delay, Indeed I’ve opened a question on the npcap forum some times ago.
To be honest I’ve try UDP a few times ago with the previous FPGA version, and it was just a test not embedded in our main software. We have not spent that much time on UDP, due to the uncertainty about the transmission order. I’ve to try it right now, building a specific test. My colleague has developed the RAW communication inside our current software through npcap. But as you tell me UDP may be more responsive than RAW on windows.

99% is good enough for your application. If you need 100% you are sunk on Windows
I was hoping to not hear this, but it was a fear I have. For this first project stage we won’t move the “core part” inside a RTOS or something like so using our current software.
So, what’s good enough for us now?
What I would like to achive is move 350bytes from FPGA to PC and back @2KHz. With a round trip time of 200us for a loopback packet, that leaves 300us for math computation over the packet. Without jitter and with a consistent behaviour.

I’ll keep you updated, thanks.

If you need to do that 99% of of the time, then no problem. If you need 100%, then you have a big problem

What I would like to achive is move 350bytes from FPGA to PC and back @2KHz. With a round trip time of 200us for a loopback packet, that leaves 300us for math computation over the packet. Without jitter and with a consistent behaviour.

Hmmm… I guess I don’t understand why this would be hard. Admittedly, I’m no NDIS guy (I haven’t touched an NDIS driver for many years). But, using a conventional WDF driver, I can definitely get about a million (Direct I/O) IOPs on a fast 16 core processor. How could 2K IOPs be a problem?

Sorry if that’s not helpful, but I didn’t want you to be saturated with doom and gloom without my trying to provide at least SOME context.

I’ve run some tests, both using UDP protocol and RAW protocol loaded to FPGA.
I’ve tested also Windows 10 IoT LTSC.
Seems there’s not a big difference in terms of stability between Windows 10 and the IoT LTSC version (installed following sysconf16 guide). IoT is a bit more “stable” in terms of results.
I’ve monitored average RTT time, and eventually delayed packets.
RAW (through npcap) seems to performs a little better than UDP.
I’ve encounter some delayed packet as @MBond2 expected. Let’s say that @ 2kHz my average RTT is 100 us on UDP and 60 on RAW, and i get almost 10 packet delayed every now and then within 5 seconds, sometimes 0, sometimes up to 10.
On windows I’m at 99% in time.
Same tests on Linux (no PREEMPT), if UI is running results are similar to Windows, maybe a little more stable but almost the same. Things change without UI, on a Debian console system no packet delayed and timing RTT average of 70 us.
I’ll try it in Windows, maybe on a powerfull machine in production.
Actual hardware is Desktop - i7-11700 32Gb Ram SSD

I think that if we are to help you further, you need to explain HOW you send the packets in more detail.

Thank you @MBond2 .
I’m going to test this on hardware within weeks, so maybe 99% accuracy will be enough for this first stage.
Anyway, find here the sample code for UDP testing: https://pastebin.com/yWRCyZXB

This is broadcast traffic? That will dramatically affect the way that it is handled in hardware and software. Does it really need to be broadcast traffic?

Having said that, I can tell you what limits the performance of this code. You have used synchronous blocking IO. What you want to do is use OVERLAPPED IO and make several
concurrent WSARecv or WSARecvFrom calls. You expect these calls to fail and GetLastError to return ERROR_IO_PENDING (there is a case where the call does succeed). Then some time later, when an inbound packet arrives, one of the pending calls completes and the newly received UDP data becomes available. Use an IOCP or the thread pool APIs to detect and process the completion, do the work required with that packet data (calculate and send a reply) and make a new pending call.

there are still causes of latency / jitter that you can’t eliminate, but one major source that you can eliminate this way, is when your ‘do work with the packet data’ takes ‘too long’ and the UDP stack does not have a UM buffer ready to fill when the network traffic arrives. This may seem to be an implausible scenario, but remember we are working with fractions of 1% of the calls, and thread preemption (along with other factors) can cause long delays.

This technique, along with multiple threads calling GetQueuedCompletionStatus, will dramatically improve your throughput (not your issue). But will also reduce both your average and peak latency

First of all thanks you @MBond2, reading code is not always something others do to help.

I can convert UDP from broadcast to single destination IP, no problem for that. Now It’s broadcast cause we have another process (that will eventually run on other PC) that records packets. That process does not have jitter problem cause it use a buffer and do not send back packet. Anyway we can omit this function for now and make this a single IP destination packet.

The reason why I’m not using sync call is because each packet need to be processed in order.
As example FPGA send packet ID 1000, and expect packet ID 1000 back within the next packet (ID 1001), each packet is sent at 500uS (2kHz). I don’t know if async will work better.

Yes, the “do work with the packet data” we know has to take a little time.
Another point is that at present with our NI board we don’t know if there are packet delayed, cause it does not implement in hardware the “flyback” and the other counters. So we have to check it by using external hardware (oscilloscope and so on…) but it’s something we have to do.
In other words… maybe we are yet in the 99% jitter condition but we don’t know it, and we are happy with that for the purpose of the current software.

We have so much work to do.
I think we are going to test the system on a real testing machine in the next weeks and measure performances, to check if they are similar to our actual NI board and driver. Remember software will be the same, written in LabWindows/CVI.

Again, thanks!

I have a bit of experience to share.

We had a telemetry system that sent packets from an FPGA via UDP over a GigE fiber optic line. The telemetry stream was continuous, 100 megabytes per second. The packets were nominally 8232 bytes each sent at 12.2 kHz, but because of the challenges of managing so many packets that fast, the FPGA munged the headers so that 7 samples were combined into a single jumbo packet, which looked like 57k bytes sent 1744 times a second.

This worked surprisingly well. We started out on Windows but shifted to Linux later. It was an expensive system, so we never shipped more than a couple dozen over a period of about 15 years. We did learn that the hardware (both host and NIC) is absolutely critical. Some inexpensive servers have I/O bus designs that simply cannot handle that many interrupts. On those systems, if we tried to do too much graphics work or tried to do USB disk copies, we would get dropped packets.

We also learned that not all fiber-capable NICs can handle jumbo packets. That was an unpleasant surprise.

We also learned that some high-end fiber NICs are extremely delicate. That works in a server room, but not in the field. This was an intrusion detection system, so the servers were set up in the field and had to be moved for demos. One manufacturer’s NIC (who I will not name) could not recover if the fiber was interrupted. We are accustomed to copper NICs, where you can unplug and replug as if nothing happened. With those NICs we tried, a disconnect would often require multiple reboots to recover, and the servers we were using took 5 minutes to reboot. It was painful during a demo.

Technically, UDP packet delivery is not guaranteed. In practice, we would usually get no more than two dropped packets a day, and often zero.

If you are happy, then we are happy!

But you do not have to use blocking IO to ensure in order responses. If the packet data itself includes an ID or sequence, you can use that to ensure responses are sent in the right order (and decide what to do with ‘old’ out of sequence packets). If the packet data does not include an ID or sequence, you can create one yourself by using a counter and extending the OVERLAPPED struct to record the sequence that goes with each completion. Remember to hold some kind of lock while calling WSARecv so that the order in which you allocate sequence numbers is the same as the order in which the driver queues and completes the IRPs

@Tim_Roberts said:
I have a bit of experience to share…

Thank you so much for sharing your experience here. Your notes on NIC hardware and UPD delivery are important to me. Requirements for the project I’m working on are little lower than yours. Let’s say 320bytes of payload excluded header (depends on protocol if RAW or UDP) @ 2Khz back and forward, so no needs for jumbo packets; anyway your feedback is something I’ll take in account.

@MBond2 said:>
But you do not have to use blocking IO to ensure in order responses…

Again thank you. I should take into account the ID too, I know. It’s something I’m doing on a logger that we have implemented. This time 800bytes almost @ 10kHz, order of the logger packets has extremely importance, so I bufferize that packets using the ID.
We are going to try in a real environment within weeks, from that I’ll check if the synch receiver/sender it’s enough or async needs to be used.

I’m now struggling with an hardware problem on the FPGA pcb NIC that we have to solve :slight_smile:

I’ll keep you updated…

Some inexpensive servers have I/O bus designs that simply cannot handle that many interrupts

FWIW, we have found that in really high throughput systems, NOT using interrupts, and rather relying on polling, is far, far, far more efficient. By using polling (by “captive” IRQL PASSIVE_LEVEL threads) you also don’t have the problems of dealing with the DPC watchdog timer. The whole system runs better. You can have quite a nice “division of labor” as well… where one thread initiates requests, and a different completes them (for example).

I’m finishing work on an FPGA driver right now that’s getting about 2.25 million IOPs, end-to-end with requests coming from user mode. This is without doing anything exceptionally clever in KMDF, just using METHOD_OUT_DIRECT IOCTLs and taking requests from a parallel Queue.

If I try to use interrupts, the system gets swamped and comes to a crawl.

Moral of the story: Don’t rule-out polling for really high-throughput implementations. It CAN be surprisingly efficient, when used for the right things.

FWIW, we have found that in really high throughput systems, NOT using interrupts, and rather relying on polling, is far, far, far more efficient.

It’s interesting you should say that. In the 3rd generation of the product I was describing, we changed from using an Ethernet fiber to using a PCIExpress board shoving the data in via DMA. We did exactly as you describe – we use polling to catch the new data instead of interrupts. It’s working quite reliably and the overhead is lower.

The 4th generation is now using USB, and we’re not quite so happy. They chose the FTDI 601 chip, and because we don’t have control over the driver, I don’t trust it. The error recovery doesn’t seem to be as good. But that’s a story for another day.