Hyper-V inter-partition communication question

anton_bassov · December 6, 2018, 6:23am

I would like to ask a question concerning the communications between two child partitions under Hyper-V.

Let’s consider the following scenario. Let’s say we’ve got child partitions A and B. The former one is a Windows guest, and the latter one is a custom non-MSFT OS that is paravirtualised for the purpose of running specifically under Hyper-V. These two have to pass fairly large amounts of data to one another. The amounts of data that is passed between them and the frequency of these “exchanges” are comparable to the ones between SMB( or NFS) client and server on a corporate network. Therefore, the efficiency is an absolute must here.

Once these two guests physically reside on the same machine there must be some room for optimisation. For example, we can significantly reduce the overhead that is introduced the network stack. The very first thing that gets into my head in this situation is designing a custom non-routable Layer 3 protocol (i.e writing a custom NDIS protocol driver and exclusively binding it to a system-provided virtual NIC that is specifically reserved for this purpose ), effectively eliminating the overhead of TCPIP.SYS, as well as PSCHED.SYS and friends. However, I wonder if it is possible to go further than that, and to avoid the overhead that is introduced by NDIS layer as well.

Therefore, my questions are

Is it possible to design a custom Vmbus Windows driver? This driver has to be a fully “supported” one - no hacking, no hooking, and, generally speaking, impeccable enough to qualify even for Mr.Burn’s approval, let alone for MSFT certification

2 In case if the answer to the above question is positive, is the potential performance enhancement worth the whole trouble of taking this path? Certainly, an optimisation like a shared buffer would come in handy in this situation, but, judging from Hyper-V functional specification, it does not seem to allow passing the pages between the guests in XEN-like fashion.
Therefore, I am not sure whether it is worth the whole trouble

In general, what is the best way to go in the above mentioned situation?

Anton Bassov

Tim_Roberts · December 6, 2018, 6:02pm

anton_bassov wrote:

Let’s consider the following scenario. Let’s say we’ve got child partitions A and B. The former one is a Windows guest, and the latter one is a custom non-MSFT OS that is paravirtualised for the purpose of running specifically under Hyper-V. These two have to pass fairly large amounts of data to one another. The amounts of data that is passed between them and the frequency of these “exchanges” are comparable to the ones between SMB( or NFS) client and server on a corporate network. Therefore, the efficiency is an absolute must here.

Once these two guests physically reside on the same machine there must be some room for optimisation. For example, we can significantly reduce the overhead that is introduced the network stack. The very first thing that gets into my head in this situation is designing a custom non-routable Layer 3 protocol (i.e writing a custom NDIS protocol driver and exclusively binding it to a system-provided virtual NIC that is specifically reserved for this purpose ), effectively eliminating the overhead of TCPIP.SYS, as well as PSCHED.SYS and friends. However, I wonder if it is possible to go further than that, and to avoid the overhead that is introduced by NDIS layer as well.

I freely admit that I’m stepping outside of my comfortable knowledge
bubble here, but it seems to me that the overhead of the TCP/IP and NDIS
stacks for virtual LAN communication between two VMs must be absolutely
trivial compared to the overhead of context switching between the two
VMs to process the data. I would have guessed the network driver
overhead was mere noise in comparison. Do you have evidence this is not
the case?

Jeffrey_Tippet_MSFT · December 6, 2018, 8:27pm

You have a few good options:

A. You do have raw access to vmbus in Windows [1]. So you can build your own synthetic device in Windows, layering over Windows’ native vmbus.sys driver. In your custom OS, you’ll have to write your own implementation of the vmbus contract [2], which is not an intern project, but it’s still doable. Depending on licenses and etc, you may also be able to derive inspiration from Linux’s or BSD’s implementations of vmbus.

B. You also have access to something called a Hyper-V socket [3]. This is essentally a cross-vm pipe that is exposed through the Windows sockets API. Other than winsock, it doesn’t use the networking stack at all. Which is good, because that means performance is better, and you don’t have to worry about addressing and firewalls, and all the other drama of the network stack. Again, for your custom OS, you have to build things up from scratch, but again Linux and BSD have examples for you to eyeball.

C. As you’ve noted, you can just bet on netvsc + the IP stack. The advantage here is that there’s basically no work to do on Windows. And your custom OS probably will need a working network stack eventually, so you probably will be glad to have built this anyway. You can get an idea of the throughput + latency by benchmarking two Windows guests, back-to-back. ~25Gbps should be a good ballpark.

Personally, I’ve written stuff on top of [1], and found it to be both high-performance and easy to program against. I’ve not used [3] yet, so I can’t tell you about its performance characteristics. I’d imagine that the bigger concern here is not which APIs are available in Windows, but how quickly you can bootstrap a paravirtualization stack in your custom OS.

Using vmbus, you can easily create pages that are shared between partitions. For example, a guest partition can call VmbChannelCreateGpadlFromMdl, which gives you a handle that the root partition can map into its address space. As a convenience, vmbus can combine that with the operation of sending a vmbus packet, so the page(s) are automatically mapped + unmapped as the packet is delivered + acked: VmbPacketSendWithExternalMdl. Refer to the docs to see all the various useful variations on this.

Once you can share pages between partitions, you can pretty much build any sort of super high-performance ring buffer. But vmbus already aims to be that, so it’ll be a challenge to meaningfully beat vmbus’s performance. So while you certainly can just use vmbus to negotiate initialization and set up a shared GPADL for your homemade lock-free ring buffer, I’d suggest not building that from scratch, and just using vmbus natively.

As to whether the perf is “worth it”… that really depends on your requirements, your workload, etc. Raw vmbus can definitely beat the performance & reliablity of using an IP stack. (This is trivially true, since our paravirtualized network stack is built on top of vmbus.)

[1] https://docs.microsoft.com/en-us/windows-hardware/drivers/ddi/content/vmbuskernelmodeclientlibapi/ More importantly, read the giant comment at the top of vmbuskernelmodeclientlibapi.h, which itself is great documentation.

[2] https://docs.microsoft.com/en-us/virtualization/hyper-v-on-windows/reference/tlfs See chapeter 11 “Inter-partition communication”

[3] https://docs.microsoft.com/en-us/virtualization/hyper-v-on-windows/user-guide/make-integration-service

Peter_Viscarola_OSR · December 6, 2018, 8:49pm

Isn’t there some “super special stuff” that the VMBus network adapter driver does to optimize guest to guest throughput?

And, don’t you really want dedicated hardware to handle this? Like, network adapters with VMQs (if you haven’t yet stumbled across VMQs, check out this Reddit post).

Peter

Jeffrey_Tippet_MSFT · December 6, 2018, 9:43pm

Isn’t there some “super special stuff” that the VMBus network adapter driver does to optimize guest to guest throughput?

The vmbus-based network adapter does have a bit of special sauce: it uses “transfer pages” to implement the synthetic receive path. Transfer pages are not exposed in the Windows API of vmbus (although the Linux & BSD implementations of vmbus expose them, I believe). However, I wouldn’t describe this as “super special stuff … to optimize guest to guest throughput”, as much as it is just a legacy mode of vmbus that has since been superceded by better GPADL support. netvsc only continues to use transfer pages because there hasn’t been a sufficiently compelling reason to break compatiblity with previous hypervisors and guests. Using fully-supported GPADLs, you should be able to get the same performance as netvsc sees from using transfer pages.

I know this because, as it happens, about a year ago I tried the exercise of reimplementing netvsc on top of the public DDK. And I stumbled on transfer pages and the more minor omission of VmbChannelPacketGetClientContext. The latter API I successfully lobbied to get added to the DDK, so it’s no longer a problem.

One “super special” feature that the Hyper-V virtual switch does not have is the ability to directly share pages between partitions for transferring bulk data. It cannot do this, because the root partition needs to enforce various security policies (MAC spoofing, etc). Your setup might well not need such tight control over security, in which case, you can go wild.

anton_bassov · December 6, 2018, 10:04pm

I freely admit that I’m stepping outside of my comfortable knowledge bubble

Well, you would not expect a “boring” question from me, would you…

it seems to me that the overhead of the TCP/IP and NDIS stacks for virtual LAN communication between two VMs
must be absolutely trivial compared to the overhead of context switching between the two VMs to process the data.
I would have guessed the network driver overhead was mere noise in comparison.

Well, there are 2 points to consider here.

First, you are not going to switch the context between the VM’s straight away, are you. According to the Hyper-V specification, the whole thing works on asynch basis - you just post a message or an event (the latter is the lighter-weight approach) to the target port, and go upon your own business. The recipient is going to get informed about it by means of a synthetic interrupt when the target VCPU gets scheduled.
The “only” question is the amount of data that you are able to pass between two guests this way. If you could only pass a PFN between two guests with a message or an event, the whole thing would be a truly light-way one. More on it below

Concerning the overhead that communication via the virtual NIC introduces, please note that is not necessarily limited to the work done in the guests. Apparently, Hyper -V and the parent partition have to do quite a bit of work behind the scenes if you use virtual NICs for communication. Now add the overhead of processing the packets, calculating checksums,“socket buffer ↔ virtual NIC buffers” copy operations, and all other network-related work that has to be performed by the guests . The most interesting part is that, once both VMs physically reside on the same machine, all this work is totally unnecessary.

Now let’s compared it to passing PFNs between the guests. According to Hyper-V functional specification, there are System physical addresses (SPAs) and Guest physical addresses (GPAs). The former ones relate to the actual physical RAM, and the latter ones define the guest’s view of physical memory. Furthermore, it says that mapping the latter to the former is possible, although it does not say anything about the reverse. I am going to thoroughly examine the list of all hypercalls that are available to unprivileged guests in order to see if SPA<->GPA mappings can be established by them. If this is the case, then we are in a position to develop a truly light-weight communication method between the guests.

However, at this point one more potential issues arises. Once we are speaking about the Windows, the availability of XYZ functionality to the system does not necessarily imply that the third-party drivers are allowed to make a use of it as well. Therefore, the problem is still not yet solved…

Anton Bassov

Jeffrey_Tippet_MSFT · December 6, 2018, 10:15pm

Now add the overhead of processing the packets, calculating checksums,“socket buffer ↔ virtual NIC buffers” copy operations, and all other network-related work that has to be performed by the guests .

I completely agree with your basic point: running everything through an ethernet + IP network stack is good for compat with unaware apps, but leaves some perf on the table. I need to make one small correction in defense of networking, though. We did at least figure out how to eliminate unnecessary checksum computations: in the guest-to-guest or root-to-guest cases, the virtual switch repurposes NDIS’s checksum offload feature to eliminate any checksum computation. (Since the transport is no less reliable than the OS itself, checksums are redundant.)

However, at this point one more potential issues arises. Once we are speaking about the Windows, the availability of XYZ functionality to the system does not necessarily imply that the third-party drivers are allowed to make a use of it as well. Therefore, the problem is still not yet solved…

Perhaps you haven’t read my earlier reply yet? I claim you actually have some pretty good options for fully-supported cross-partition features from within Windows.

anton_bassov · December 6, 2018, 11:16pm

Thank you so much, Jeffrey - you just cannot even imagine how glad I am to find out that he whole thing is perfectly feasible and well-supported!!! I was just typing my reply to Tim so that I had no chance to see your post, which obviously ( at least partly) invalidates my previous one

The option (1) seems (at least at this point) to be the optimal one. The ultimate advantage of the option (2) is that it saves you from all the trouble of writing a VMBUS driver for Windows. OTOH, we have to keep in mind the custom OS’s side of the project as well, and in this respect the option (1) seems to be preferable. At least the Linux code that is available at https://github.com/LIS/lis-next, as well as the Linux kernel tree, seem to be offering quite good reference designs. Taking into consideration that (if I got you right, of course) the option (1) offers the best performance anyway, it seems to be the right way to go

Thank you so much

Anton Bassov

Jeffrey_Tippet_MSFT · December 7, 2018, 6:55pm

Taking into consideration that (if I got you right, of course) the option (1) offers the best performance anyway, it seems to be the right way to go

Yes, you got me right. I think using vmbus is likely the best option for your purpose.

Note that Linus’s tree (not the LIS repo you linked to) has the latest vmbus source code.

anton_bassov · December 8, 2018, 3:04pm

Note that Linus’s tree (not the LIS repo you linked to) has the latest vmbus source code.

Thank you so much, Jeffrey…

I’ve got one more question. A brief look at vmbus_drv.c in the Linux kernel tree immediately reveals vmbus driver parses ACPI resources, which means VMBUS is an ACPI device. This, in turn, leads me to the logical conclusion that, in order to be able to expose a custom synthetic VMBUS device to the custom OS that runs in a child partition, one needs to write some Windows service that has to be running in the parent partition.

Did I get it right? If yes, could you please advise me what exactly has to be developed for the parent partition

Thanks in advance

Anton Bassov

Jeffrey_Tippet_MSFT · December 12, 2018, 1:07am

vmbus itself is enumerated by ACPI. However, child nodes of vmbus do not need a presence in ACPI. That is, your guest OS is supposed to parse ACPI enough to figure out where vmbus is (i.e., what memory ranges are assigned to it). But once you have that working, it’ll be enough to drive any number of vmbus channels/devices.

One caveat though: vmbus’s primary purpose is to be the pipe for paravirtualized devices. So it’s primarily organized as a client/server model, with the root partition (aka “management OS”) as the server, and any arbitrary guest VM as the client. If you want to have two guest VMs communicate, you’ll probably have to set up some sort of driver in the root that opens channels between the two guest VMs.

I’ve asked internally how efficient you can make that. Ideally there would be a way to avoid running CPU cycles in the host partition, just to memcpy payload from one GPADL to another. I’ll let you know if I hear back on a better way.

anton_bassov · December 12, 2018, 2:23am

Thank you so much, Jeffrey…