You’re asking a bunch of questions here. I could answer most of them from
the perspective of the current Hyper-V release or the ideal virtualization
system. Practice never quite catches up with theory, and in this case,
we’ve got quite a long way to go. I’ll assume that you’re primarily curious
about the current Hyper-V release (Windows Server 2008 R2 - for the
archives) simply because that’s the only production-class virtualization
system today that uses Windows drivers in anything but the guest VM.
The Hyper-V VM Switch is a layer-2 switch. The physical NIC is essentially
the uplink port. Packets are only sent to it when the switch has already
decided that those packets deserve to make it on the wire, by one criteria
or another.
We have experimented with various ways of managing fairness and flow control
between guest VMs and the physical port. Some were pretty high in overhead
and gave us performance characteristics that we didn’t like. Furthermore,
the current version of Hyper-V makes no attempt to strongly manage quality
of service and it’s not even possible to configure a VM NIC or a switch port
with a QoS statement. So we’ve found that simple mechanisms that generate
some approximation of fairness, but which are very cheap to implement, are
the currently the best fit.
To whit, each VM NIC has a corresponding buffer in the guest VM. This
buffer is visible to the VM switch to which the VM NIC is attached. The
guest OS can queue up as many packets in that buffer (where larger packets
are essentially described by MDLs, not copied) as fit within it. The VM NIC
must stop queuing when the buffer is full, as there’s nowhere else to put
anything.
The VM switch rotates among the VM NICs looking for work to do, pulling
large lists of packets off each one in turn. Control structures and
Ethernet headers are copied through the VM switch’s internal buffers, which
are finite. It’s entirely possible that it will drop packets in certain
conditions.
The VM switch indicates packets to the physical NIC using MDL chains that
describe headers in the VM switch and higher-layer protocols and data in the
VM’s shared buffer. (Again, whether the higher-layer stuff is just copied
or described by a MDL depends on how big it is. The exact size is something
we will continue to tune.) These structures aren’t freed until the NIC
returns the net buffers back up the stack. (The VM can mess with any part
of it that it can see, which is why the layer-2 stuff must come from buffers
that that VM cannot see.)
I don’t think that there’s much you can do in your NICs and miniports to
help the VM switch beyond what Ali Dabagh has described at various WinHECs
and DDCs. Offloading performance counters is really nice. Header/Data
split is key. VMQ is very useful. I suspect that some of you will be
tempted to buffer up packets in software so that the VM switch won’t drop
packets as soon as it would otherwise. While you could certainly do this,
it would probably just tie up memory that only makes a difference for the
brief moment between when the VM switch’s buffers overflow and when yours
do.
Most of what you say below is true. Useful traffic shaping and flow
management is hard. Perhaps we’ll take that on in the next release.
–
Jake Oshins
Hyper-V I/O Architect
Windows Kernel Group
This post implies no warranties and confers no rights.
“Jan Bottorff” wrote in message
news:xxxxx@ntdev…
>> Calvin’s has the right approach, but I still say that most calls to
>> NdisSend are “lying sends” in reality.
>
> I agree with Calvin, driver developers should assume packet send
> completion
> from the hardware simply mean the controller is done DMAing your buffer.
> There are LOTS and LOTS of places for packets to get lost before you get a
> packet received indication at the other end.
>
> A related question is: what does packet send completion for a virtual NIC
> in
> a VM actually mean?
>
> I’m really curious what happens on Hyper-V (and other virtual system)
> virtual NIC’s when the physical nic can’t transmit as fast as all the VM’s
> trying to send packets. Do you get head of line blocking in the virtual
> switch on ALL the virtual nics for that switch? And QoS badly degrades? Or
> does the virtual switch start dropping packets on some virtual nics,
> attempting to prevent every virtual nic from blocking? Or can the virtual
> switch somehow throttle the send packet flow from virtual nics such that
> packets are never dropped and everybody get’s fair access to the physical
> nic (which seems difficult since physical nics often just queue send
> packets, and the sender has no idea how full these queues are). Some nics
> (lots of nics?) queue packets in software when there is no space in the
> hardware queues, potentially allowing very large queue depths (which is
> very
> bad for latency, which causes the RTT time to climb and TCP throughput to
> fall). I don’t actually know the tradeoffs on TCP throughput of dropping
> packets vs. huge queues. Some documentation on (desired) vmswitch behavior
> would be helpful so we understand the interaction between nic drivers and
> the virtual switch. Are there things drivers could do to better cooperate
> with vmswitch?
>
> Really packet flow control needs to happen at the connection flow level
> not
> the adapter packet level, which is a good argument in favor of TOE
> adapters
> (which at least for TCP connections, can know the internal queue levels
> and
> apply round-robin or bandwidth limits per flow). TOE can also send TCP
> acks
> at higher priority and without the latency of interrupting the host,
> reducing the RTT. Efficiently dealing with network flow control and
> congestion is just a huge sticky area (and at the core of the whole net
> neutrality debate).
>
> Jan
>
>