I’ve got a problem simultaneously receiving multiple (over 50, 5-6 Mbps each) multicast UDP streams, since if I using the standard windows sockets approach I lose a lot of packets and thus sometimes I can’t decode such streams (I try to receive mpeg2-ts streams). I think this happens due to the large amount of KM - UM switchings (WSARecv always returns the one UDP datagram). So I guess that if I’ll be able to gather packets in KM I’ll get some performance boost. I have some experience in KM driver writing, but never before wrote drivers for networking. I found that microsoft suggests WPF for similar purposes, but I’m not sure that I’ve understood correctly. So I want to ask you, dear experts, from where should I start? I’ll be very appreciated for any suggestions.
Before you rush out to kernel-mode… the usermode APIs can scale VERY well, easily to 50 simultaneous UDP streams of 5-6Mbps each. But it’s not obvious how to get the best scalability out of winsock. You need to set a few magic flags, know which APIs are “bad” and which are “good”, etc.
Kernel is not always faster, and kernel is definitely more difficult - I see a 10x times reduction in developer productivity when I have to do something like this in kernel versus usermode.
So I suggest you give usermode winsock another chance. If you can get it working in usermode, that’ll save you lots of headaches in the long run.
Take a look at http://ctstraffic.codeplex.com/ . This is a sample written by a winsock guru on the Windows Networking team. The tool scales really well. He’s gotten this thing to scale to half a million simultaneous connections to a single server. (Since TCP and UDP only provide 65,000 local port numbers, he had to start assigning a bunch of local IP addresses, just to get enough port numbers to go around.)
You can use ctsTraffic as a proof-of-concept – if that usermode tool can scale to 70 UDP streams of 10Mbps each without significant packet loss, then it’s clearly possible to tackle your scenario in usermode. Once you’re satisfied it’s possible, then you can start borrowing ideas and code from the winsock sample until your app reaches its scalability goals.
Here’s an example usage of ctsTraffic that runs 70 simultaneous UDP streams of 10Mbps each, sending 100 UDP datagrams per second per connection:
Server:\> ctsTraffic.exe -listen:* -protocol:udp -verify:connection -bitspersecond:10000000 -framerate:100 -streamlength:60
Client:\> ctsTraffic.exe -target:yourservername -connections:70 -iterations:1 -protocol:udp -verify:connection -bitspersecond:10000000 -framerate:100 -streamlength:60 -bufferdepth:5
The workstations sitting on my desk are connected via 1Gbps Ethernet link. When I run those commands, the server uses about 20% CPU, the client uses about 15% CPU, and they push a steady 720Mbps between them, with about 0.02% datagram loss rate. It’s unlikely that you can improve on that by going to kernelmode.