Unusual behavior with NDIS6 and Hyper-V network adapter

Recently I’ve discovered an issue using our NDIS6 protocol driver with Hyper-V network adapters. The issue is unique to this adapter and I don’t see the problem anywhere else. Our protocol driver has run on VMware, VirtualBox and numerous physical adapters without an issue. I’m hoping someone here will have some insight. I’ll try to put this in a nut shell and expand later.
.
To send network packets, the user creates the packet in user space and passes the typedef structure to the driver via ioctl call. From there we do the typical calls to:
IoAllocateMdl()
MmProbeAndLockPages()
Creating the NET_BUFFER_LIST is where it gets interesting.
.
Send Method 1:
I can use the MDL created from the locked user space memory with NdisAllocateNetBufferAndNetBufferList(), NdisSendNetBufferLists() then release the memory in the SendNetBufferListsCompleteHandler. This works fine with Hyper-V.
.
Send Method 2:
I can make a copy of the user space MDL with using NdisAllocateMemoryWithTagPriority(), NdisAllocateMdl(), release the user space memory and use the new MDL with NdisAllocateNetBufferAndNetBufferList() and NdisSendNetBufferLists(). This does not work with Hyper-V.
.
Why make a copy? The send performance can be significantly faster using kernel memory instead of user space memory (up to 20% faster). Perhaps someone knows more about this phenomenon since it’s counter intuitive. On some systems the copy makes no difference.
.
What happens when it does not work? It appears that the NET_BUFFER_LIST gets stuck (or lost) during the send and the SendNetBufferListsCompleteHandler is never called. I’m puzzled by this behavior. If I create a Hyper-V legacy adapter (emulated Intel 21140 PCI Fast Ethernet) it has no issues with method 2.
.
Even though I can solve the problem by using method 1 it’s just a work around solution. I don’t want to lose performance so I’ve added a patch that detects the Hyper-V adapter and disables method 2. Any suggestions? I can post the code that copies the MDL and NET_BUFFER_LIST if anyone is interested.

I have no idea about your problem with hyper-v, but I am curious about your deduction re performance. Are you measuring with a specific application or tool? Is this TCP or UDP traffic? What kind of acceleration is active? Is it possible that the difference is not the result of adding a memory copy, but the difference in timing when UM gets the chance to submit more data to send?

Our driver can be used to send any protocol because you have to build the network packet yourself. Typically we use RAW ethernet with no IP address – once again for speed. We also use ioctl (instead of IRP_MJ_WRITE and IRP_MJ_READ) for reading and writing because we can send/receive multiple packets with one operation. When we call NdisSendNetBufferLists() the NET_BUFFER_LIST usually contains many packets. The only other acceleration we use is asynchronous IO (overlapping write operations). We also use synchronous IO and the Hyper-V issue is present for both.
.
As for measuring the performance I’m not doing anything fancy. I have a test-send tool that allows me to send packets at a given rate. I can switch between method 1 and 2 with a simple call to the driver. From there I just look at the network performance in the task manager. There always seems to be a performance increase for async IO but not always for sync IO. Who knows what black magic happens after NdisSendNetBufferLists() is called.

okay, so you have a custom protocol driver that has a custom upper edge. You are testing it with your own test program and observing the result of the test via task manager. When the performance is different, it is different by enough that it is easily visible this way.

Based on this, I would guess that you have a problem with the threading model used. This guess is based on very tenuous evidence and could be entirely wrong, but here are some thoughts

Using commodity hardware, and the in box UDP driver, it is possible to completely saturate a 10 Gb/s link with traffic generated from UM passing through all of the standard layers. I have not tried this on 40 Gb/s or 100 Gb/s links, or from a VM, but for you to be able to notice a difference in throughput, you must be operating at less than the wire speed

Often (nearly always?) when async completions get ‘lost’, the root cause is a thread sync issue of some kind. Often caused or observed because of differences in timing between typical events and extraordinary ones. This effect is amplified by all hypervisors, but hyper-v may amplify it more than some others because of the sophistication of the interactions between host and guest Windows kernels

You haven’t told us how you complete the IOCTL, but after making a copy of the MDL, it would be possible to do that right away. Even if you don’t the OVERLAPPED completion timing will change - possibly allowing your UM test program to submit the data to be sent more quickly

You mention that you can assign the rate of packet generation. That implies some scheme of timers or delays - both of which are subject to accuracy issues that are also amplified by hypervisors. It is also easiest to implement with a single sending thread. This also implies that at least some of the time, you are not attempting to make the IO go as fast as it possibly can, so performance measurements are going to be harder to interpret reliably.

These are all guesses, but my supposition is that you aren’t using an IOCP or the thread pool APIs, and that causes the difference in performance that you see between methods

This discussion can go in several directions simply because there is so much happening from the point where the packet is created in user space to the point where it’s dispatched by NDIS. In my case, I’m seeing two anomalies that I can’t explain.

  • Copying a user space MDL to a kernel space MDL seems to perform better in a NET_BUFFER_LIST. You would think the act of making a copy would reduce performance but it does not. The performance is equal or better.
  • Only Hyper-V network adapters have a problem sending a NET_BUFFER_LIST that contains a kernel space MDL.

My observation of increased performance might be of interest to other NDIS developers. I’m more interested in understanding why a Hyper-V network adapter doesn’t like sending a NET_BUFFER_LIST that contains a kernel space MDL. This causes a deadlock in the driver. For some reason the SendNetBufferListsCompleteHandler stops being called. This is where the IRP is completed so it’s critical that the callback happens. I’ve noticed a few packets are sent before it stops. I don’t want to say it’s a bug in Hyper-V because it’s more likely that I’m doing something wrong however I can’t explain why it only happens with Hyper-V adapters.
.
Some other things you correctly noticed, I am not maxing out the wire speed (bandwidth). The point of our driver is not just throughput - it’s consistent performance. We are trying to achieve guaranteed latency (or something close). No, we are not using an IOCP or the thread pool APIs. We would like to leverage some Hyper-V enhancements in the future like Virtual Machine Queue.

I suggest that you reproduce the problem and then crash the OS and collect a dump. If there is a deadlock, it should be obvious. And the call stacks might provide a clue as to the cause.

Memory corruption is also a possibility, but it is very hard to say. It might be useful to keep a trace buffer (circular array) of IRP pointers so you can locate that memory in the dumps

Task manager seems an odd tool to use to measure performance when consistent latency is an important factor. It reports only very coarse metrics that are ‘contaminated’ by other system activity. Performance counters are probably more appropriate, but you do have to track the event / byte counts yourself and register them. For the first time user, those APIs are a mess, but they boil down to giving Windows a pointer that it can read periodically to get a new sample, and telling it what it should read when it does. Obviously I can only guess at the specifics of what you need

RE the performance itself, it does seem increasingly likely that it is related to thread scheduling and the IO pattern. As that’s not your focus, we can leave it at that

Yes, I guess you’re right. I have to debug the issue by examining the callstack, addresses and memory. I was just hoping someone here had some additional tips or advice when using NDIS with Hyper-V adapters. After all, NDIS has additional features and options specifically for Hyper-V. It’s not like other network adapters in this way. As a developer I can’t help but think it could be one of those features that are misbehaving.
.
Thanks for the help. If I find a solution I’ll be sure to post it here.

Hi Rob,

Did you try to use NdisAllocateSharedMemory instead of NdisAllocateMemoryWithTagPriority?

Most probably the memory that is allocated by NdisAllocateSharedMemory function will use correct DMA operation (meaning it will be aware of IOMMU presence), while NdisAllocateMemoryWithTagPriority will not.
A hint from ReactOS code: https://doxygen.reactos.org/d1/d1d/drivers_2network_2ndis_2ndis_2memory_8c.html#a5b9f5f0d5489ca096544fc4415e31a6e

Best regards,
Yan.