Windows Packet Direct / DPDK on Windows

Hi,

Are there any code examples, and list of NICs you can use, for implementing PacketDirect Client Interface (PDCI) on Windows? The goal is to shift the speed of packet processing from the range of 100-400 Kpps, per core, into, potentially?, 1Mpps for custom protocol, SCTP, in this case. Previous implementations we have done, c# or C++ socket calls, yielded 100-400Kpps range, in testing, on commodity devices with back to back cabling.

Performance suggested is required for telecoms applications.

Is there experience of using PacketDirect and did anyone obtain an uplift in the capacity to handle many small packets with it? I’m aware of DPDK and some musing about this being ported and usable on Windows.

However I’m coming up short on documentation / examples to get started

Packet Direct is introduced here;

https://docs.microsoft.com/en-gb/windows-hardware/drivers/network/introduction-to-ndis-pdpi

I asked the same question @ stack overflow here;

https://stackoverflow.com/questions/48291158/windows-packet-direct?noredirect=1#comment92281608_48291158

and it’s been suggested I try my luck here.

Thank you in advance for any insights and pointers in the right direction.

Best Regards,

Simon

I know nothing about PDCI or anything that you are trying to do, but I do know that a trivial UM application can saturate a 10 Gb/s link with UDP traffic without resorting to any special tricks.

What threading / IO model have you used in your C# (clearly UM) or C++ (UM or KM) code?

How much context does your protocol need for responding to packets? Can each one be processed independently or do they need to be grouped so that certain ones are processed in a particular order?

Sent from Mailhttps: for Windows 10

________________________________
From: SimonH
Sent: Saturday, October 6, 2018 6:41:14 AM
To: MBond
Subject: [NTDEV] Windows Packet Direct / DPDK on Windows

OSR https://community.osr.com/
SimonH started a new discussion: Windows Packet Direct / DPDK on Windows

Hi,

Are there any code examples, and list of NICs you can use, for implementing PacketDirect Client Interface (PDCI) on Windows? The goal is to shift the speed of packet processing from the range of 100-400 Kpps, per core, into, potentially?, 1Mpps for custom protocol, SCTP, in this case. Previous implementations we have done, c# or C++ socket calls, yielded 100-400Kpps range, in testing, on commodity devices with back to back cabling.

Performance suggested is required for telecoms applications.

Is there experience of using PacketDirect and did anyone obtain an uplift in the capacity to handle many small packets with it? I’m aware of DPDK and some musing about this being ported and usable on Windows.

However I’m coming up short on documentation / examples to get started

Packet Direct is introduced here;

https://docs.microsoft.com/en-gb/windows-hardware/drivers/network/introduction-to-ndis-pdpi

I asked the same question @ stack overflow here;

https://stackoverflow.com/questions/48291158/windows-packet-direct?noredirect=1#comment92281608_48291158

and it’s been suggested I try my luck here.

Thank you in advance for any insights and pointers in the right direction.

Best Regards,

Simon</https:>

So reading up on this interface for about 5 minutes, I can tell you that it belongs to the class of stuff you should never touch unless you cannot possibly avoid it.

PDCI is an interface designed principally to address the needs of VoIP and Video calling or other latency or jitter sensitive applications in servers that will be special purpose dedicated to handling this traffic. Windows is fundamentally unsuited to handling this kind of traffic so don’t waste your time. Where Real time operating systems are not used, various *nix modifications are made such as dedicating crores to polling hardware, running in tight loops for infinity, etc. There is no way to make anything close to this in Windows, but if your goal is not ‘packet to packet latency consistency’ or ‘packet to packet latency minimization’ but rather ‘bulk throughput increase’, then there is lots that Windows can do for you.

‘packet to packet latency consistency’ is something that is important to systems that forward VoIP traffic. In some sense it doesn’t matter how quickly the traffic can be forwarded, as within reason the end user experience does not degrade with an extra delay; but the end user experience does tangibly degrade with any variation in that delay. Many other types of traffic have this pattern too including TFTP and many naïve TCP applications

‘packet to packet latency minimization’ is important to high frequency trading applications. In this arena, the absolutely most important thing is to get the packet out as fast as possible as the sooner that you do, the better the likelihood that you will be ‘first in the book’ at the exchange – thus ensuring preferential execution at a given price level and probably a better opportunity to make money on a given transaction. Many other traffic types have this pattern too including many military applications

‘bulk throughput’ refers to the idea of getting the most productive work possible out of a given hardware platform. In this sense, it is the most general of the objectives mentioned so far (the other one that I don’t list here is resource use minimization) and it is by far the best one that you can hope to achieve using Windows and commodity hardware. If you objective falls into this category – or can be recapitulated so that it falls into this category then you are in a place where you can possibly succeed.

In order to provide more than this very general advise I will need to wait for your response. My participation on this form is highly variable although Peter has recently informed me that I have over 800 posts – just to make me feel old I think as an ulterior motive ? but I will try to watch for messages from you and help you as I can.

Sent from Mailhttps: for Windows 10


From: Marion Bond
Sent: Sunday, October 7, 2018 7:10:00 PM
To: SimonH
Subject: RE: [NTDEV] Windows Packet Direct / DPDK on Windows

I know nothing about PDCI or anything that you are trying to do, but I do know that a trivial UM application can saturate a 10 Gb/s link with UDP traffic without resorting to any special tricks.

What threading / IO model have you used in your C# (clearly UM) or C++ (UM or KM) code?

How much context does your protocol need for responding to packets? Can each one be processed independently or do they need to be grouped so that certain ones are processed in a particular order?

Sent from Mailhttps: for Windows 10


From: SimonH
Sent: Saturday, October 6, 2018 6:41:14 AM
To: MBond
Subject: [NTDEV] Windows Packet Direct / DPDK on Windows

OSR https://community.osr.com/
SimonH started a new discussion: Windows Packet Direct / DPDK on Windows

Hi,

Are there any code examples, and list of NICs you can use, for implementing PacketDirect Client Interface (PDCI) on Windows? The goal is to shift the speed of packet processing from the range of 100-400 Kpps, per core, into, potentially?, 1Mpps for custom protocol, SCTP, in this case. Previous implementations we have done, c# or C++ socket calls, yielded 100-400Kpps range, in testing, on commodity devices with back to back cabling.

Performance suggested is required for telecoms applications.

Is there experience of using PacketDirect and did anyone obtain an uplift in the capacity to handle many small packets with it? I’m aware of DPDK and some musing about this being ported and usable on Windows.

However I’m coming up short on documentation / examples to get started

Packet Direct is introduced here;

https://docs.microsoft.com/en-gb/windows-hardware/drivers/network/introduction-to-ndis-pdpi

I asked the same question @ stack overflow here;

https://stackoverflow.com/questions/48291158/windows-packet-direct?noredirect=1#comment92281608_48291158

and it’s been suggested I try my luck here.

Thank you in advance for any insights and pointers in the right direction.

Best Regards,

Simon</https:></https:>

Thank you for the lengthy responses :slight_smile: Saturation of the link is indeed possible with a UM application, if you use large packets. For the application(s) in use / mind , the traffic pattern is many, tiny packets.

Currently we use an abandoned kernel driver SCTP by Bruce Cran for the SCTP protocol. Which has served well. However when we re write SCTP in C# we arrive at performance, similar, in terms of packets per second to the SCTP kernel driver.

I take your point on jitter and latency on Windows with time scheduler slices. Well known to us from earlier historic endeavours.

However having googled various presentations etc from Microsoft for DPDK or PDCI they seemed to indicate a reduction in latency and jitter, whilst increasing pps and via a UM apps using kernel bypass. Which seemed to be exactly what we were looking for. IE UM apps and custom UM protocol(s) with improved pps, whilst avoiding? driver work. What we struggled to do was to find examples, or an example, in order to build and test, to see if it was a fit for our plans, or even if our plans are feasible on current platform.

We are looking to build a new base with custom telecom protocols onto which we can then add service layers, such as GSM call control, SS7 firewall credit balance check, fraud prevention etc. The services require differing performance, some are 100pps and millisecond latency, some would be 10kpps, and sub millisecond latency and million of pps( ie SS7 firewall etc, high volume call control). The more performance we can build into the core, the wider range of services we can build on top of it.

We are at the stage, and have the time and resource, to decide which routes are available and which of those routes would be the right one for us. c#UM and some sort of magical?, low latency, jitter and high throughput packet interface are, of course the preferred route. Driver work and downright horrible low level APIs are also possible, and so is a shift to Linux.

Our in house experience and tooling is windows and C# for last 10 years or so, but we started with assembly, Pascal,C, C++ a long, long time ago.

The struggle we have at the moment is evaluating one possible route, ie the musing from Microsoft regarding DPDK port or PDCI for lack of documentation and examples. We’d like to do that before taking the larger, more involved steps, of lower levels languages, driver development or shift to Linux. Each of which, for us, having associated cost to gain experience etc.

We wanted to evaluate, the perceived easiest route for us, given infrastructure and knowledge, and wondered if anyone had been ahead of us and could point the route to something we could test.

We tried stack exchange but when we get into these sorts of realms the pool of those who require such tends to shrink somewhat. Fast packet processing on Windows seems to be rather new, relatively, speaking.

Best Regards,

Simon

PDCI only works if you have a NIC + NIC driver that support PD. There really aren’t very many of these available, and Microsoft is not generally going to be expanding PD support right now.

Intel and the DPDK project are working on DPDK support for Windows. This has most recently been demoed here https://www.youtube.com/watch?v=2kj4dkvCRRw (around timestamp 36:00).

DPDK-on-Windows is still pretty bleeding-edge; the patches are still in a separate experimental repository. And even on other OSes, DPDK itself is generally bleedy-edgy too; it’s not at all a complete solution, but rather a box of tools that you can bang on to cobble together a solution. (The case study mentioned in the video, for example, is that Cisco used DPDK to build a high-performance video broadcasting soluion.)

I can’t say what’s right for you and your solutions, but I can give very general guidance.

In general, Windows sockets API is very fast and low-latency. We spend a lot of time tuning it in every release. However, for various reasons, there’s more than one way to use the sockets API, and some ways are substantially worse than others. So there’s almost always low-hanging fruit to be gained just by optimizing your use of the sockets API. E.g., use async + IO completion ports, ensure you’re posting sufficiently deep buffers, fiddle with the number of threads doing IO and their priorities, consider using RIO.

If you’re using middleware (like .NET), make sure it’s not doing anything silly, and replace it if necessary. For example, .NET Core is these days getting faster than .NET Framework, but the fastest applications on .NET Core still p/invoke out to native code like libuv.

If you’ve truly exhausted your options with usermode Winsock, you may consider moving parts of your application into kernel. E.g., Windows’s own SMB and HTTP stacks use WSK, so they have much tighter control over threading and IO.

If you’re really really needing that last mile of performance, though, you can completely bypass the operating system with DPDK. DPDK’s neigh-unbeatable performance edge comes about becase it takes over the hardware, and operates it directly from your application; the OS has no visibility or control over what you’re doing. The disadvantage is that (a) DPDK-on-Windows is still very new, so is quite raw, and (b) DPDK is way more work. Since you’re not getting any help from your OS, you have to do everything yourself; you can’t just open a socket and recv a packet.

(Wow, today I learned that mentioning a youtube URL causes this forum software to embed the video directly. Sorry about that.)

Right

So based on this I can read that your goal is to provide a solution to these problems at a tangibly lower cost base than pre-existing solutions. Hence the emphasis on Windows as a commodity OS and C# as a commodity language. There is nothing wrong with this as a business model, but it will bear directly on the technology choices that you should make.

Assuming that I am approximately correct in your objectives, you should focus on your service layers first. You can implement them all with the built in thread pool APIs (either natively or in managed code) and the way that you implement them will have a much more profound impact on your overall performance than anything that you can do at a lower level. Understanding how to break up the work of a given request into an effective

@“Jeffrey_Tippet_[MSFT]” said:
(Wow, today I learned that mentioning a youtube URL causes this forum software to embed the video directly. Sorry about that.)

I just learned that too. Don’t be sorry! I think it’s kinda cool.

Newfangled inventions…

Peter

Many thanks for the replies, always much appreciated.

I can say with reasonable certainty that after about 5 years we have likely exhausted what existing Windows sockets can do for us. We’ve tried completion ports and RIO et all. As each new technology came out we spent a fair amount of time testing the network throughput.

In an attempt to establish a high watermark for theoretical performance we altered the current NDIS driver sample to send out the same minimum size packets and count at the other end, some time back. We went through many iterations, differing setups and tests regarding over the years. Net result is, as now seems obvious, you don’t get DPDK like performance without DPDK or similar. I remember, I think, a slideshow from Microsoft estimating circa 700kpps for RIO.

The purpose of the question asked, was to determine if there was a known, accessible, way to harness DPDK,Packet Direct etc in order to enter a different realm of network performance, on Windows. IE use a UM application and call an API to attain this. The answer, from our experience, and the replies so far, suggests that there isn’t, at least not yet, and who knows when /if there might be.

We have had the service layers mentioned for approx 4 years and they work well, but the current attainable network performance precludes deployment of some of them. The others provide a steady revenue stream. For the customers, they want network equipment like performance for some of the functions.

I think given we’re unlikely to be able to test anything Windows wise, we’ll forgo that idea and spend some time evaluating DPDK on Linux, along with our capability regarding.

Once again, many thanks for all the posts.

Best Regards,

Simon

@SimonH Curious which path did you end up choosing? We’re thinking about going DPDK/linux route but there isn’t a big enough community yet so you learn & experiement to get it right. Couple of other downsides to DPDK aree that it takes over the entire NIC, so you either need to have multiple NICs OR use KNI. Also, the only way to get packet notifications is to poll constantly so the cores you use for network comm are always at 100%. This is fine if its a dedicated machine assigned one job but probably not ideal for a machine with multiple roles.