Hello,
a few months ago I’ve done a driver for sending UDP packets at high rates with a low jitter.
(see http://www.osronline.com/showThread.cfm?link=250081)
Meanwhile this driver works quite good and now we did some tests on virtual machines (VMware vSphere? 5.1 vMotion Architecture). There I can see a completely different timing and sometimes the UDP packets are delayed up to 12 millisecs (vs. 0.5 millisecs on physical machines). Can I do something inside my driver to improve this, or could the only improvement be done in a special driver for the hosting machine and so outside the virtual machine and outside my driver?
Thanks and best regards,
Johannes F.
You realize that a VM can be preempted, if the VM host is oversubscribed?
Yes, I’m aware of preemption and I’m not thinking the virtual machine will ever completely behave like a physical machine. But 12 millisecs is very much and so I thought about ways to improve this, but I’m not sure whether this is possible inside my driver at all, or whether everything depends from the VM host. So if the complete VM (including all its drivers) is just a normal user mode process of the VM host, I will probably not be able to solve this inside my driver. … and then it’s not a driver question, but a VM host question …
I am working recently on virtual drivers that runs on virtual machine
(Linux QEMU). Let me give you alternative thinking of doing this, but it
won’t probably solve your problem, given that the guest CPU bandwidth is
limited by hypervisor.
- on receiving user mode UDP buffer, lock down user buffer, and creates a
MDL for the UDP buffer.
- Instead of sending the UDP buffer to kernel winsock , send the
scattered guest physical buffers (associated with the user UDP buffer) to a
virtual socket device (which is described later).
- when virtual socket device completes the send() operation, it signals
guest that a buffer is done.
The reason why to write your own “virtual socket” device is that kernel
winsock driver eventually uses underlying virtual network device. It could
be that the virtual network device is not fast enough. In my experience
with Linux QEMU, the QEMU emulates a PCI rtl8139 or Intel e1000. As you
know, there is overhead when using those emulated network device. People
could tweak their QEMU netowrk setting (eg. by using virtio network) to
achieve better performance. I have no knowledge how those virtual net
device performs. In general, a para-virutalized network device has better
performance than emulated PCI network device. I don’t know how the virtual
network device performs on (VMware(R) vSphere 5.1 vMotion Architecture) . I
guess it is some kind of para-virtualized net device that talk to net
driver on host side.
Since your application operations on layer 4 ( above IP layer) , it is
obvious that encapsulating UDP/TCP packet all the way to layer 2 (802.3
MAC) packet could cause certain performance penalty. It is intuitively
efficient if you could pass guest user space UDP buffer directly to host
side socket interface. To do so, you need to create “virtual socket device”
, if it is feasible . In QEMU , you could cook your device, and expose it
to guest without a problem. I am not sure if it is feasible to so in VMware
hypervisor. Assumming it is feasible to do so, the following describes the
virtual hardware interface:
virtual socket device interface:
- It is a PCI or PCI-e device. It provides 2 BARs. BAR0 is for MMIO. BAR1
is for scatter-gather linked list.
- It is your decision on how scatter-gather link list looks like. For a
guest UDP buffer, it could be physically broken up into several guest
physical segments. You put those guest physical segments into
scatter-gather link list, and signal to hardware that the link list is
updated by writing MMIO register.
- The hardware upon receiving MMIO write, translates those guest physical
segments into host virtual buffers. Then you use scatter-gather IO
primative. On Windows host, you could use WSASend(), On linux host, you
could use writev() to do scatter-gather send of the scatter-gather guest
physical segment associated with the guest user space UDP buffer.
- when scatter-gather is completed, it signals guest OS by firing an
interrupt. The guest virtual socket device driver checks scatter-gather
list and find which UDP user buffer is done, and complete that UDP request
by IoCompleteIRequest().
Using the above mechanism, you guarantee that there is no memory copy
required for sending guest user buffer to guest kernel space. In host side,
you could configure your socket to omit “buffering”(memory copy) when
doing send by setsockopt(). You also omit the overhead of layer 4 to layer
2 encapsulation/decapsulation.
The only limitation comes from hypervisor scheduling.
Good luck.
2014-05-28 21:29 GMT+08:00 :
> Hello,
>
> a few months ago I’ve done a driver for sending UDP packets at high rates
> with a low jitter.
> (see http://www.osronline.com/showThread.cfm?link=250081)
>
> Meanwhile this driver works quite good and now we did some tests on
> virtual machines (VMware vSphere® 5.1 vMotion Architecture). There I can
> see a completely different timing and sometimes the UDP packets are delayed
> up to 12 millisecs (vs. 0.5 millisecs on physical machines). Can I do
> something inside my driver to improve this, or could the only improvement
> be done in a special driver for the hosting machine and so outside the
> virtual machine and outside my driver?
>
> Thanks and best regards,
> Johannes F.
>
> —
> NTDEV is sponsored by OSR
>
> Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev
>
> OSR is HIRING!! See http://www.osr.com/careers
>
> For our schedule of WDF, WDM, debugging and other seminars visit:
> http://www.osr.com/seminars
>
> To unsubscribe, visit the List Server section of OSR Online at
> http://www.osronline.com/page.cfm?name=ListServer
>
Hi JiaBang,
thanks for your detailed answer. Yes, it’s probably a very good idea to pass all the data to the guest system and do the real UDP sending there. Perhaps I even don’t need a kernel driver on VM client side as a scheduling with latencies up to 12 millisecs I assume will be the same for user mode processes on the VM client? So if I use just normal sockets for transferring data in bigger blocks between VM client and VM host (which I already do now between user mode and kernel mode) this should not be the bottleneck?
But it could be difficult to find the precise timing for sending the UDP packets on the VM host side, as the timestamps provided in the big block are based on the frequency counter of the VM client. So I’d need to pass the frequency counter of the VM host to the VM client or to know at least the constant offset (assuming that they’re running at the same frequency)
Johannes
>Meanwhile this driver works quite good and now we did some tests on virtual machines (VMware
vSphere? 5.1 vMotion Architecture).
Try another hypervisor. Will the result be different?
–
Maxim S. Shatskih
Microsoft MVP on File System And Storage
xxxxx@storagecraft.com
http://www.storagecraft.com
Scheduler granularity (quantum aka timeslice) is only used in a case of 100% CPU load, if the single thread is running for all these 12ms without blocking on some events.
KeSetEvent-style calls do not use this delay (unless the higher-priority thread is running on the same core).
So, this is some implementation artifact of VMWare’s hypervisor, which cannot be dealt with from the guest.
“Marion Bond” wrote in message news:xxxxx@ntdev…
I assume that you realize that your 12 ms delay is consistent with the scheduler granularity of the host OS. It should be obvious that because the guest OS only gets CPU time when the host OS scheduler allows it, there will be no way to achieve any better consistency than what the host OS and hypervisor allow
Sent from Surface Pro
From: xxxxx@Freyberger.de
Sent: Wednesday, May 28, 2014 11:24 AM
To: Windows System Software Devs Interest List
Yes, I’m aware of preemption and I’m not thinking the virtual machine will ever completely behave like a physical machine. But 12 millisecs is very much and so I thought about ways to improve this, but I’m not sure whether this is possible inside my driver at all, or whether everything depends from the VM host. So if the complete VM (including all its drivers) is just a normal user mode process of the VM host, I will probably not be able to solve this inside my driver. … and then it’s not a driver question, but a VM host question …
—
NTDEV is sponsored by OSR
Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev
OSR is HIRING!! See http://www.osr.com/careers
For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars
To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer