The typical causes of low TCP throughput are
packet loss
Failure to push data into the stream quickly enough on the sending end
Failure to pull data out of the stream quickly enough on the receiving end
The TCP window size changes depending on the timing of how the data is pushed into / pulled out of the stream on each end and can give you an indication of a problem, but is unlikely to be a root cause.
Assuming that you do not have packet loss, you need to look at how you are reading / writing to the socket. This is confirmed by the fact that simple test programs can achieve the desired throughput, and that the change in timing / driver topology caused by your hypervisor have a significant impact.
Based on the fact that you cite wait functions, you probably have a serial IO model. You further suggest this via TCP no delay and jumbo frame use. Ultimately this IO model will be your bottleneck, but it may not be obvious in your VM environment depending on the implementation of the NIC driver for the virtual NIC and hypervisor switch
To check, you should setup a pert counter to track your 'call rate’ for sending. And your 'pending buffer’ size for reading. Simple interlocked counters can track this and then you can sample the results and track trends.
Assuming that I am even close to right, you will want to change to a completion based deserialized design
Sent from Surface Pro
From: xxxxx@gmail.com
Sent: Wednesday, December 24, 2014 8:49 AM
To: Windows System Software Devs Interest List
Thanks Alex and Maxim for replying.
Alex
- Similar program is user mode gives high throughput provided we keep data uninterrupted. A sleep or waitforXXX primitives slows it down significantly. We ran netperf which comsumes about 80-85% of the network bandwidth. Also a sample kernel mode program gives good bandwidth provided we dont wait on any responses or do any other work. But the user mode program or the kernel mode drivers are simplistic emulations of what we are trying to do and dont capture the exact code flow.
- Network capture shows that the TCP window becomes small for the real hardware however it remains as 64K for the virtual machines. We tried to boost this up using the SO_SND_BUF and SO_RECV_BUF settings however this didnt help either.
- Nagle is disabled as we set the WSK_FLAG_NO_DELAY. Wireshark shows that the PSH flag set in the TCP header.
Maxim
Tried on these too. The performance is even worse than Intel. However this might be due to the fact that the windows drivers for the Broadcom cards dont respect the Jumbo Frame settings higher than 1500. If we set anythin higher the TCP show only packets of 546 bytes flowing through. On intel cards we have the Jumbo Frame set to ~4K.
Again, thanks for replying guys. Really appreciate it. 
NTDEV is sponsored by OSR
Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev
OSR is HIRING!! See http://www.osr.com/careers
For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars
To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer