Unable to achieve high network bandwidth using Winsock Kernel

We are working on a failover cluster solution for Windows and have written a Winsock kernel mode driver to ship the IO’s to remote host. However we seem to be getting very low transfer bandwidth (50 MBps) across the network working on a real hardware. If we run the same solution on Virtual Machines connected by the same backend network we see about 10 times the speed (500 MBps). The only glaring difference in the setups is that the VM’s are using a virtualized network card (over a real Broadband NetXtreme BCM57800 10 gigabit) whereas the physical machines use Intel 10G network cards. Interestingly, the latency for a single IO over both Physical and Virtual setups is about the same (280 microsecs)
A typical packet flow for our driver over the network looks like this
Comp1 ----> Comp2 (Read request for 4K in a header of 64 bytes)
Comp2 ----> Comp1 (Read request response header of 64 bytes + 4K bytes of data)
The machines are deployed on Windows 2012 R2.

Following are the things we tried

  1. Compared both the adapter and TCP settings of both the setups. Except for some custom entries on the virtualized adapter we couldnt find many differences. We copied all the settings (Network adapter and TCP) from the virtual to the physical setups.
  2. Analyzing with Wireshark, we found that the Virtual Machines are able to achieve a higher TCP window size as compared to the physical machines. Thinking as it may be a result of the Send and Receive buffers, we changed them both to 2MB programatically for the physical machines. This improved the scenario a bit however we still couldnt go beyond ~55 MBps.
  3. Tried the various TCP settings from http://www.speedguide.net/articles/windows-7-vista-2008-tweaks-2574 with no luck.
  4. We are using the WskReceiveEventCallbacks for receives since Microsoft recommends it.

Please do excuse if the questions sound very basic. We dont have much windows kernel mode networking background to begin with and would appreciate any help or pointers that we get.

If you run similar program in usermode, what throughput do you get?

Do you see any problems in the network capture?

Does your request have a PUSH bit set in the header? Is Nagle enabled on the socket?

>real Broadband NetXtreme BCM57800 10 gigabit) whereas the physical machines use Intel 10G

network cards.

Try on physical machines with Broadband NetXtreme BCM57800 10 gigabit.

What will the speed be?


Maxim S. Shatskih
Microsoft MVP on File System And Storage

Thanks Alex and Maxim for replying.

  1. Similar program is user mode gives high throughput provided we keep data uninterrupted. A sleep or waitforXXX primitives slows it down significantly. We ran netperf which comsumes about 80-85% of the network bandwidth. Also a sample kernel mode program gives good bandwidth provided we dont wait on any responses or do any other work. But the user mode program or the kernel mode drivers are simplistic emulations of what we are trying to do and dont capture the exact code flow.
  2. Network capture shows that the TCP window becomes small for the real hardware however it remains as 64K for the virtual machines. We tried to boost this up using the SO_SND_BUF and SO_RECV_BUF settings however this didnt help either.
  3. Nagle is disabled as we set the WSK_FLAG_NO_DELAY. Wireshark shows that the PSH flag set in the TCP header.

Tried on these too. The performance is even worse than Intel. However this might be due to the fact that the windows drivers for the Broadcom cards dont respect the Jumbo Frame settings higher than 1500. If we set anythin higher the TCP show only packets of 546 bytes flowing through. On intel cards we have the Jumbo Frame set to ~4K.

Again, thanks for replying guys. Really appreciate it. :slight_smile:

You may not be posting enough receive buffers. Are you enabling WskReceiveEvent, or using WskReceive?

Actually I am using the same code that I am using on the Virtual Machines. If that was the case i should have seen the problem on both setups. We have tried using both the WskReceiveEvent callbacks as well as WskReceive. The former gave bit better performance.