Delay in calling ReturnNetBufferListsHandler on Windows Server 2022

Hello,

I debug an interesting problem which is reproducible with tap-windows6 driver on Windows Server 2019/2022 when using iperf3. The problem doesn’t reproduce with iPerf2 or on Windows 10.

I have an OpenVPN tunnel between Windows and Linux and run iperf3 -c 10.8.0.1 -R -V -t 1 on Windows machine. This doesn’t really work - after a while iPerf completes with very poor performance like tens on Kb. For the contrast, without “-R” flag (means that Windows machine sends) or on Windows 10 I got over 100Mbit/s. Without VPN I got 3.6Gbit/s, but poor performance of the tap-windows6 driver is outside of scope of this problem.



Here is a very simplified data flow:

iperf <-> tun interface (tap-windows6 driver) <-> openvpn (encrypt/decrypt) <-> network <-> other peer


I took TCP dump on both ends and found out that while on transport interface all TCP packets arrive from Linux to Windows as expected, on VPN (tun) interface at some point certain packet arrives with noticeable delay, like one second. I added additional logging to OpenVPN and turns out that delay happens during writing to tun interface. OpenVPN uses overlapped IO - it passes OVERLAPPED structure to WriteFile and then does WSAWaitForMultipleEvents, including event from abovementioned structure. No further writes to tun happen unless WSAWaitForMultipleEvents reports that previous write has been completed.

WriteFile inside OpenVPN calls IRP_MJ_WRITE handler in tap-windows6 driver. This call indicates received data to OS. IRP_MJ_WRITE handler does this (yes I know that this is far from optimal and we've developed another driver which improves performance dramatically, but this is another topic):
  1. allocates NBL
  2. copies packet content to NBL
  3. marks IRP as pending
  4. calls NdisMIndicateReceiveNetBufferLists
  5. returns STATUS_PENDING

After that NDIS calls ReturnNetBufferListsHandler callback (usually this happens between steps 4 and 5), where we free NBL and complete IRP from previous step.

The mysterious thing is that in iperf3 test scenario, 8th (and following) packet of length 1442 takes about second to digest. Here is a driver log:

packet 6, so far so good
Thu Apr  7 15:22:42.639 2022 (UTC + 3:00): 
Thu Apr  7 15:22:42.640 2022 (UTC + 3:00): Enter IRP_MJ_WRITE, 1442 bytes
Thu Apr  7 15:22:42.640 2022 (UTC + 3:00): 
Thu Apr  7 15:22:42.640 2022 (UTC + 3:00): NdisMIndicateReceiveNetBufferLists()
Thu Apr  7 15:22:42.640 2022 (UTC + 3:00): 
Thu Apr  7 15:22:42.640 2022 (UTC + 3:00): Enter ReturnNetBufferListsHandler
Thu Apr  7 15:22:42.640 2022 (UTC + 3:00): 
Thu Apr  7 15:22:42.641 2022 (UTC + 3:00): IoCompleteRequest, bytecount 1456, status 0
Thu Apr  7 15:22:42.641 2022 (UTC + 3:00): 
Thu Apr  7 15:22:42.641 2022 (UTC + 3:00): Exit ReturnNetBufferListsHandler
Thu Apr  7 15:22:42.641 2022 (UTC + 3:00): 
Thu Apr  7 15:22:42.641 2022 (UTC + 3:00): Exit IRP_MJ_WRITE, status=259
Thu Apr  7 15:22:42.641 2022 (UTC + 3:00): 

packet 7, still good
Thu Apr  7 15:22:42.641 2022 (UTC + 3:00): Enter IRP_MJ_WRITE, 1442 bytes
Thu Apr  7 15:22:42.641 2022 (UTC + 3:00): 
Thu Apr  7 15:22:42.642 2022 (UTC + 3:00): NdisMIndicateReceiveNetBufferLists()
Thu Apr  7 15:22:42.642 2022 (UTC + 3:00): 
Thu Apr  7 15:22:42.642 2022 (UTC + 3:00): Enter ReturnNetBufferListsHandler
Thu Apr  7 15:22:42.642 2022 (UTC + 3:00): 
Thu Apr  7 15:22:42.642 2022 (UTC + 3:00): IoCompleteRequest, bytecount 1456, status 0
Thu Apr  7 15:22:42.642 2022 (UTC + 3:00): 
Thu Apr  7 15:22:42.642 2022 (UTC + 3:00): Exit ReturnNetBufferListsHandler
Thu Apr  7 15:22:42.642 2022 (UTC + 3:00): 
Thu Apr  7 15:22:42.643 2022 (UTC + 3:00): Exit IRP_MJ_WRITE, status=259
Thu Apr  7 15:22:42.643 2022 (UTC + 3:00): 

packet 8, here we have a problem
Thu Apr  7 15:22:42.643 2022 (UTC + 3:00): Enter IRP_MJ_WRITE, 1442 bytes
Thu Apr  7 15:22:42.643 2022 (UTC + 3:00): 
Thu Apr  7 15:22:42.648 2022 (UTC + 3:00): Packet len 1442 #8, expect delay!
Thu Apr  7 15:22:42.648 2022 (UTC + 3:00): 
Thu Apr  7 15:22:42.649 2022 (UTC + 3:00): NdisMIndicateReceiveNetBufferLists()
Thu Apr  7 15:22:42.649 2022 (UTC + 3:00): 
Thu Apr  7 15:22:42.649 2022 (UTC + 3:00): Exit IRP_MJ_WRITE, status=259
Thu Apr  7 15:22:42.649 2022 (UTC + 3:00): 
!!DELAY!!
Thu Apr  7 15:22:44.111 2022 (UTC + 3:00): Enter ReturnNetBufferListsHandler
Thu Apr  7 15:22:44.111 2022 (UTC + 3:00): 
Thu Apr  7 15:22:44.112 2022 (UTC + 3:00): IoCompleteRequest, bytecount 1456, status 0
Thu Apr  7 15:22:44.112 2022 (UTC + 3:00): 
Thu Apr  7 15:22:44.113 2022 (UTC + 3:00): Exit ReturnNetBufferListsHandler
Thu Apr  7 15:22:44.113 2022 (UTC + 3:00): 

packet 9, again delay
Thu Apr  7 15:22:44.113 2022 (UTC + 3:00): Enter IRP_MJ_WRITE, 1442 bytes
Thu Apr  7 15:22:44.113 2022 (UTC + 3:00): 
Thu Apr  7 15:22:44.114 2022 (UTC + 3:00): NdisMIndicateReceiveNetBufferLists()
Thu Apr  7 15:22:44.114 2022 (UTC + 3:00): 
Thu Apr  7 15:22:44.115 2022 (UTC + 3:00): Exit IRP_MJ_WRITE, status=259
Thu Apr  7 15:22:44.115 2022 (UTC + 3:00): 
!!HEY NDIS, WHY IS TAKES SO LONG TO DIGEST THIS TCP PACKET?!!
Thu Apr  7 15:22:45.115 2022 (UTC + 3:00): Enter ReturnNetBufferListsHandler
Thu Apr  7 15:22:45.115 2022 (UTC + 3:00): 
Thu Apr  7 15:22:45.117 2022 (UTC + 3:00): IoCompleteRequest, bytecount 1456, status 0
Thu Apr  7 15:22:45.117 2022 (UTC + 3:00): 
Thu Apr  7 15:22:45.117 2022 (UTC + 3:00): Exit ReturnNetBufferListsHandler
Thu Apr  7 15:22:45.117 2022 (UTC + 3:00): 

I can work around the problem by specifying NDIS_RECEIVE_FLAGS_RESOURCES flag when calling NdisMIndicateReceiveNetBufferLists - this allows to free NBLs immediately, complete IRP and return STATUS_SUCCESS - I checked that this works. However this doesn’t answer the question why certain packet indication takes about second and only on Windows Server 2019/2022 - there is no such problem on Windows 10. Also this particular driver (tap-windows6) is retiring and I would prefer not to make somewhat major change, especially without understanding the root cause.



Any ideas how I can debug this problem further, figure out what causes this delay and get rid of it?

Few things I did:

I catched problematic NBL (packet) in debugger and checked its log:

 !ndiskd.nbl ffffce83e95cfa40 -log
    Allocated
    MiniportReceived   ffffce83e95d38e0 - Netmon Lightweight Filter Driver-0000
    FilterReceived     receive in NDIS, sorting to Opens
    IndicatedToProtocol ffffce83e91ee010 - TCPIP
    ProtocolReturned   ffffce83ec5081a0 - TAP-Windows Adapter V9
    Freed 
<snip>                                                                      
    Allocated
    MiniportReceived   ffffce83e95d38e0 - Netmon Lightweight Filter Driver-0000
    FilterReceived     receive in NDIS, sorting to Opens
    IndicatedToProtocol ffffce83e91ee010 - TCPIP
    ProtocolReturned   ffffce83ec5081a0 - TAP-Windows Adapter V9
    Freed                                                                       
    Allocated

    Show callstacks of each event

I also managed to break during that delay and checked NBL owner:

0: kd> !ndiskd.nbl ffffce83e95cfa40 -NblCurrentOwner
    Owner              Open                ffffce83e91ee010 - TCPIP
 

Some more info about NBL:

0: kd> !ndiskd.nbl ffffce83e95cfa40
    NBL                ffffce83e95cfa40    Next NBL           NULL
    First NB           ffffce83e95cfbc0    Source             ffffce83ec5081a0 - TAP-Windows Adapter V9
                                           Pool               ffffce83e95b78c0 - tap0901
    Flags              INDICATED, NBL_ALLOCATED, SCRATCH_1, SCRATCH_2,
                       SCRATCH_4, SCRATCH_8, MINIPORT_1, PROTOCOL_200_0
    Parent NBL         NULL                ChildRefCount      cacacaca

    Walk the NBL chain                     Dump data payload
    Show out-of-band information           Show in Microsoft Network Monitor
    Review NBL history
    Search for child NBLs (best effort)

So if owner of this NBL during delay is TCPIP protocol driver, does it mean that problem could be there? Is there something about that packet which causes delay inside TCP stack?

I don’t know what your problem is, but the use of WSAWaitForMultipleEvents is a major red flag. It implies some kind of select loop - a design that always performs badly on Windows. Overlapped IO is unhelpful unless you make sure that you can pend several concurrently

I don’t know what your problem is, but the use of WSAWaitForMultipleEvents is a major red flag. It implies some kind of select loop - a design that always performs badly on Windows. Overlapped IO is unhelpful unless you make sure that you can pend several concurrently.

We service multiple events with that wait call (or select/epoll depends on OS), such as socket read/write, tun read/write, console read, management interface read/write. Our model is “single packet, single stream” oriented - we read packet from net, decrypt, write to tun. Read from tun, decrypt, write to net - assuming that fds are ready to read/write. In Windows we don’t do another write to tun (or any same IO operation for the same fd) unless previous overlapped operation is completed (associated event is in raised state, as reported by WSAWaitForMultipleEvents). Same for other platforms.

I don’t know if this is a bad design, but I can say this has been working on many platforms for the last 20 years (openvpn is that old). The problem has appeared starting from Windows Server 2019. Yes, we have an assumption that tun writes are fast - why digesting packets directed to local machine would take time - and looks like this breaks on latest Windows Servers. Why it takes tcpip.sys whole second to consume certain TCP packet?

This sounds like a classic select loop design - a single thread loops on a wait and then processes the next thing that has signaled that it needs attention. This kind of design always performs poorly on every NT based Windows version.

That’s probably not your issue. It sounds like something else is your problem. But I point it out because you are asking about a performance problem and seem to have a performance limiting design. It depends on your hardware of course, but 3.6 Gb/s with no crypto seems low. Any IOCP based design (and new code should use the thread pool APIs because they hide all of the boilerplate parts) should be able to do much better unless there is an actual network speed problem

Given that you are particularly concerned with the receive (to Windows) path, one thing that you should note about this design is the potential receive buffer underflow problem. If the receive call from UM is not issued until after data is ready, there is a limited capacity to buffer data in KM and any thread scheduling delay will limit the throughput because the TCP window size will eventually have to be adjusted and the sender will slow the rate

This is a very simplistic and high level analysis. There are many details and the latest versions of Windows along with NIC hardware / drivers have many features and optimization that can make this flow very complex