disturbed UDP payload on VMs

Dear Gentlemen,

I?m facing a very strange problem that seems to be somehow related to a driver I?ve written.

Our setup is two (or more) virtual machines running on VMware (OS is Windows Server 2012 R2). Each VM sends one Multicast UDP packet per millisecond and receives the packets from the other VMs. Now what we?re watching is, that about every 1-3 hours there seem to be about 130 invalid packets on the receiving side, then it?s OK again. Every receiver sees the same invalid packets, so the problem is probably on the sending side. And now it becomes interesting: I?m wiresharking the sender and the receivers. The wireshark on the sender doesn?t show the invalid packets, everything seems correct here, packets go out in one msec intervals and the packets are exactly the one I?m sending. On the receiver side those 130 packets arrive as burst and the payload is not the same I?ve written there and the bytes the wireshark on the sender sees. So the payload seems to get disturbed somewhere ?below? wireshark.

For sending I?m using a driver I?ve written which uses winsock kernel to send the packets. My socket is a raw socket and I?m passing the complete packet including the IPv4 header to WskSendTo. So for my test I?m writing identical values to the IPv4 Identification field and to UDP RTP SequenceID and my wireshark filter checks these bytes (filter is: udp port 5004 and not (ip[4:2] == udp[10:2]) ). Using this filter the sender wireshark doesn?t see problems, but all receivers do. Even more interesting: the IPv4 identification on receiving VMs increase continuously as I?ve written them, but the UDP payload shows gaps and repeated values until it settles down again after those about 130 packets and everything is fine again.

I?ve tried to do a similar test without winsock kernel just from user space but I could not reproduce the problem here.

We?ve never seen such a behavior on physical machines, and meanwhile there are many instances of this software and this driver running all over the world. So I?m sure we?d have heard about problems. But VMware support says, they never touch the payload.

Do you have any impression, whether this is probably more a VMware issue or a windows thing? What other tests could be helpful? ? running the same test on W10 VMs or running the test with unicast instead of multicast?

Thanks and best regards,
Johannes F,

Historically, it’s never been a good idea to rely on a VM’s sense of time, since it may be shut out by a host process. Modern implementations and platforms make this less likely, but it’s still something to watch for.

Is it possible that this is just a vagary of the VM clock? Do you see the same number of packets sent as received (whether valid or invalid)?

-----Original Message-----
From: xxxxx@lists.osr.com [mailto:bounce-628439-
xxxxx@lists.osr.com] On Behalf Of xxxxx@Freyberger.de
Sent: 29 March 2017 10:25
To: Windows System Software Devs Interest List
Subject: [ntdev] disturbed UDP payload on VMs

Dear Gentlemen,

I?m facing a very strange problem that seems to be somehow related to a
driver I?ve written.

Our setup is two (or more) virtual machines running on VMware (OS is
Windows Server 2012 R2). Each VM sends one Multicast UDP packet per
millisecond and receives the packets from the other VMs. Now what we?re
watching is, that about every 1-3 hours there seem to be about 130
invalid packets on the receiving side, then it?s OK again. Every
receiver sees the same invalid packets, so the problem is probably on
the sending side. And now it becomes interesting: I?m wiresharking the
sender and the receivers. The wireshark on the sender doesn?t show the
invalid packets, everything seems correct here, packets go out in one
msec intervals and the packets are exactly the one I?m sending. On the
receiver side those 130 packets arrive as burst and the payload is not
the same I?ve written there and the bytes the wireshark on the sender
sees. So the payload seems to get disturbed somewhere ?below?
wireshark.

For sending I?m using a driver I?ve written which uses winsock kernel
to send the packets. My socket is a raw socket and I?m passing the
complete packet including the IPv4 header to WskSendTo. So for my test
I?m writing identical values to the IPv4 Identification field and to
UDP RTP SequenceID and my wireshark filter checks these bytes (filter
is: udp port 5004 and not (ip[4:2] == udp[10:2]) ). Using this filter
the sender wireshark doesn?t see problems, but all receivers do. Even
more interesting: the IPv4 identification on receiving VMs increase
continuously as I?ve written them, but the UDP payload shows gaps and
repeated values until it settles down again after those about 130
packets and everything is fine again.

I?ve tried to do a similar test without winsock kernel just from user
space but I could not reproduce the problem here.

We?ve never seen such a behavior on physical machines, and meanwhile
there are many instances of this software and this driver running all
over the world. So I?m sure we?d have heard about problems. But VMware
support says, they never touch the payload.

Do you have any impression, whether this is probably more a VMware
issue or a windows thing? What other tests could be helpful? ? running
the same test on W10 VMs or running the test with unicast instead of
multicast?

Thanks and best regards,
Johannes F,


NTDEV is sponsored by OSR

Visit the list online at:
http:
>
> MONTHLY seminars on crash dump analysis, WDF, Windows internals and
> software drivers!
> Details at http:
>
> To unsubscribe, visit the List Server section of OSR Online at
> http:</http:></http:></http:>

Take a look at TCP/IP offloading settings.
I’ve had some problems in the past as UDP packets were with wrong checksum.

There are a variety of ways to debug network traffic from VMs.

  1. I’m almost positive VMWare supports a monitor tap in the virtual switch. This works like monitoring on a physical Ethernet switch, and sends a copy of packets for a specific VM to some other VM running wireshark
  2. You could configure the VMWare network switch to route packets between the two VMs via a physical link to a physical switch, which could then use monitor mode to spy on packets going by. This would likely need two dedicated physical NICs on VMWare, or perhaps some VLAN trickery to split a single physical NIC into multiple virtual NICs.
  3. You might try a different network sniffer, like Microsoft has two, the older Network Monitor 3.4 and the newer Message Analyzer 1.4. Each has pros and cons, and the older NM 3.4 seems to capture better and for simpler traffic analyzes ok. MA 1.4 has a much better analyzer architecture for stream reassembly, and a more sophisticated protocol definition language if you need to define your own protocols.

If VMWare is corrupting packets, you will need to either pursue support from VMWare, or figure out what stimulates the corruption and stop doing that. If you really need a packet reliably sent every millisecond, running in a VM is probably not going to give you an appropriate solution, as VMs are sharing system resources with other VMs, and cpu scheduling happens.

Jan

On 3/29/17, 2:25 AM, “xxxxx@lists.osr.com on behalf of xxxxx@Freyberger.de” wrote:

Do you have any impression, whether this is probably more a VMware issue or a windows thing? What other tests could be helpful?

One question: which version of VMWare?

Also, I assume you know that UDP is inherently unreliable, so while this behaviour should not happen, it is in fact legal for the virtual switch to do this.

Assuming that I understand you correctly and the problem only exists when you send the traffic from a KM socket, and not from an equivalent UM socket, and you can readily reproduce this in your lab, I would check which para-virtualization extensions are enabled and see what happens by changing those settings

Sent from Mailhttps: for Windows 10

From: xxxxx@Freyberger.demailto:xxxxx
Sent: March 29, 2017 5:24 AM
To: Windows System Software Devs Interest Listmailto:xxxxx
Subject: [ntdev] disturbed UDP payload on VMs

Dear Gentlemen,

I?m facing a very strange problem that seems to be somehow related to a driver I?ve written.

Our setup is two (or more) virtual machines running on VMware (OS is Windows Server 2012 R2). Each VM sends one Multicast UDP packet per millisecond and receives the packets from the other VMs. Now what we?re watching is, that about every 1-3 hours there seem to be about 130 invalid packets on the receiving side, then it?s OK again. Every receiver sees the same invalid packets, so the problem is probably on the sending side. And now it becomes interesting: I?m wiresharking the sender and the receivers. The wireshark on the sender doesn?t show the invalid packets, everything seems correct here, packets go out in one msec intervals and the packets are exactly the one I?m sending. On the receiver side those 130 packets arrive as burst and the payload is not the same I?ve written there and the bytes the wireshark on the sender sees. So the payload seems to get disturbed somewhere ?below? wireshark.

For sending I?m using a driver I?ve written which uses winsock kernel to send the packets. My socket is a raw socket and I?m passing the complete packet including the IPv4 header to WskSendTo. So for my test I?m writing identical values to the IPv4 Identification field and to UDP RTP SequenceID and my wireshark filter checks these bytes (filter is: udp port 5004 and not (ip[4:2] == udp[10:2]) ). Using this filter the sender wireshark doesn?t see problems, but all receivers do. Even more interesting: the IPv4 identification on receiving VMs increase continuously as I?ve written them, but the UDP payload shows gaps and repeated values until it settles down again after those about 130 packets and everything is fine again.

I?ve tried to do a similar test without winsock kernel just from user space but I could not reproduce the problem here.

We?ve never seen such a behavior on physical machines, and meanwhile there are many instances of this software and this driver running all over the world. So I?m sure we?d have heard about problems. But VMware support says, they never touch the payload.

Do you have any impression, whether this is probably more a VMware issue or a windows thing? What other tests could be helpful? ? running the same test on W10 VMs or running the test with unicast instead of multicast?

Thanks and best regards,
Johannes F,


NTDEV is sponsored by OSR

Visit the list online at: http:

MONTHLY seminars on crash dump analysis, WDF, Windows internals and software drivers!
Details at http:

To unsubscribe, visit the List Server section of OSR Online at http:</http:></http:></http:></mailto:xxxxx></mailto:xxxxx></https:>

Dear gents,

thanks for all your answers.

Yes, we are fully aware of the unreliability of UDP and the timing issues of virtual machines.

I just added the timing effect (resulting in a packet burst of ~130) to get closer to the real problem which is the disturbed payload (apart from that, I assume, that once we solved the payload problem, we won’t see these extreme packet bursts anymore). I also think that UDP unreliability results in lost packets (and that’s what you very rarely see on physical machines) but not in a bunch of damaged packets.

So here some more answers to your questions:

  • at the moment we’re using VMware 6.0 but we’re discussing to switch to 6.5.

  • the total number of packets arriving at each listener is correct and even the IPv4 Identification field shows no missing or doubled packets, it’s just the payload

  • VMware support is already included and after a closer look at our setup and the wireshark logs they accepted that it’s probably something outside our software, but they are discussing, whether it could be something inside VMware or something between wireshark and VMware, as they are claiming not to touch any payload. They even have installed a similiar setup like ours and try to reproduce the problem there with our software. I’m not sure about the status of this test.

  • our setup consists of two hosts and each one runs two VMs. We watch the identical damaged packets on all three receivers, so there’s no difference whether it’s a physical or software switch.

  • where do these mentioned network tools (Network Monitor 3.4 and the newer Message Analyzer 1.4) capture the network traffic compared to Wireshark?

Best regards,
Johannes

Packet damage is moot for a virtual switch unless you have bad RAM in the host (which should result in other random errors and not manifest in a highly repeatable pattern of data corruption). If you are concerned about this possibility, you should run some tests after swapping DIMMs and varying the total number installed. If the corruption persists, and its pattern does not alter, you can safely assume that this is not the case and that there is a software problem of some kind.

Standard packet capture tools in Windows are implemented as NDIS protocol drivers. Pcap (used by Wireshark) and Microsoft Network Monitor Driver have nearly identical capabilities. This is the higher levels in the network stack (IP etc.) hand data over to NIC drivers, so if you are seeing your expected data in the send path, and not in the receive path at this level, then you can be sure that the corruption will have occurred in one of these locations

  1. the virtual NIC driver in the sending VM
  2. the virtual bus / switch (or whatever they call it in VMware)
  3. the virtual NIC driver in the receiving VM

these three points are high level and each one encompasses a complex set of interconnecting technologies, but the point is that if the packet stream is intact at the sending protocol edge, and corrupt at the receiving protocol edge, then these are the only logical components in between

this is why I asked you about para-virtualization settings and someone else asked about TCP offload settings. These are the settings on both VMs that control how the virtual NIC drivers work with respect to the real hardware and how the functionality from above the protocol edge (checksums etc.) leaks into the virtual NIC

Sent from Mailhttps: for Windows 10

From: xxxxx@Freyberger.demailto:xxxxx
Sent: March 30, 2017 2:01 AM
To: Windows System Software Devs Interest Listmailto:xxxxx
Subject: RE:[ntdev] disturbed UDP payload on VMs

Dear gents,

thanks for all your answers.

Yes, we are fully aware of the unreliability of UDP and the timing issues of virtual machines.

I just added the timing effect (resulting in a packet burst of ~130) to get closer to the real problem which is the disturbed payload (apart from that, I assume, that once we solved the payload problem, we won’t see these extreme packet bursts anymore). I also think that UDP unreliability results in lost packets (and that’s what you very rarely see on physical machines) but not in a bunch of damaged packets.

So here some more answers to your questions:

- at the moment we’re using VMware 6.0 but we’re discussing to switch to 6.5.

- the total number of packets arriving at each listener is correct and even the IPv4 Identification field shows no missing or doubled packets, it’s just the payload

- VMware support is already included and after a closer look at our setup and the wireshark logs they accepted that it’s probably something outside our software, but they are discussing, whether it could be something inside VMware or something between wireshark and VMware, as they are claiming not to touch any payload. They even have installed a similiar setup like ours and try to reproduce the problem there with our software. I’m not sure about the status of this test.

- our setup consists of two hosts and each one runs two VMs. We watch the identical damaged packets on all three receivers, so there’s no difference whether it’s a physical or software switch.

- where do these mentioned network tools (Network Monitor 3.4 and the newer Message Analyzer 1.4) capture the network traffic compared to Wireshark?

Best regards,
Johannes


NTDEV is sponsored by OSR

Visit the list online at: http:

MONTHLY seminars on crash dump analysis, WDF, Windows internals and software drivers!
Details at http:

To unsubscribe, visit the List Server section of OSR Online at http:</http:></http:></http:></mailto:xxxxx></mailto:xxxxx></https:>

It’s not only one machine, which sends these disturbed packets. We can see this random problem in all the streams from every machine (2 hosts each one hosting 2 VMs). So I think we can rather sure exclude a bad RAM. Furthermore we’ve already tried different NICs (E1000 and vmxnet3) and different switches (VSphere Standard Switch and VSphere Distributed Switch) but that didn’t solve the problem. It shows even inside the same host. We did not check para virtualization and TCP offload options yet, as this contradicts a littlw bit our idea of virtualization and the offload options can have a big impact on CPU usage as we want to do realtime streaming finally sending several packets per millisecond.

Best,
Johannes