NDISTest reporting out of order packets

Hey everyone,

I’m trying to nail down an out of order issue in my NDIS 6.2 driver’s rather simple send and receive path, and this is probably highlighting a fundamental flaw in my understanding. I found this excellent archived thread and I believe I’m fulfilling the requirements listed by Jeffrey, but the problem remains.

To set the stage, in the simple case I have a 128 core system, 16 transmit queues and 32 receive queues (2 per MSI-X entry). The receive path is easy since it uses RSS, so there’s no real way to reorder packets as Windows is selecting the queues and I’m indicating packets as I receive them.

In the transmit path, 16 cores are affinitized to the 16 queues via MSI-X entry, and they handle calling NdisMSendNetBufferListsComplete. I’ve spread those 16 queues across the other 112 cores evenly for MiniportSendNetBufferLists.

Now the logic is simple, when I receive a NET_BUFFER_LIST from MiniportSendNetBufferLists, I find the mapped queue, and append that NET_BUFFER_LIST to its work list. The thread then pops off whatever is at the head of the work list and writes it to the NIC. These 16 queues aren’t ordered with each other and can all write to the NIC simultaneously, but from the docs this seems fine since each CPU only has a single queue it will write to. Going through Jeffrey’s instructions:

If a single call to MiniportSendNetBufferLists has two packets A1 and A2; you will transmit A1 before transmitting A2.

This is true, since a single call to MiniportSendNetBufferLists always lands on the same queue, and writing to the NIC is serialized per queue via spinlock.

If we call you at DISPATCH_LEVEL on CPU X to send packet B1, then later call you again at DISPATCH_LEVEL on CPU X to send packet C1; you will transmit B1 before transmitting C1.

This is true, since we finish with an entire list of NET_BUFFER_LISTs before moving on to the next, and CPUs always write to the same queue.

If we tell you to send packet D1, and you call NdisMSendNetBufferListsComplete(D1), then later we tell you to send packet E1; you will transmit D1 before transmitting E1.

This is true, assuming it’s referring to packets sent on CPUs mapped to a single queue.

I’m somewhat at a loss here – can anyone confirm that it’s fine if the ordering between NET_BUFFER_LISTs on different CPUs isn’t maintained? Or are TCP streams (where ordering matters) always going to hit the same CPU?

Cheers,
David

well, that’s a big machine. I haven’t looked up the latest specifications, but I would expect that to be a 4 socket server or larger. and all of those systems are NUMA based. so the first question is how you are ‘evenly’ distributing the send queues over those cores? Usually the best performance does not come from an even distribution, but from a distribution that considers the NUMA node to which the hardware is actually attached. but like all things performance, your mileage will vary and do your own testing

The next question is, are you actually seeing a problem? It is clearly impossible for ordering of queues dispatched by independent CPUs to be anything other than arbitrary. And streams of packets where inter-packet ordering is important are usually assigned to the same queue (I.E. a single TCP stream) while uncorrelated streams of packets are assigned to other queues (i.e. multiple TCP streams). But even if that is not so, switching and routing issues can readily cause out of order packet arrivals. And so the modern versions of the TCP stack in Windows (and other platforms) are able to avoid excessive retransmissions which kill throughput. This is an essential feature when considering 10 Gb/s + link speeds.

The rule of course, is don’t intentionally cause out of order packets, but since they are going to happen anyways, a few won’t be so bad.

AFIK modern switches are MUCH better at avoiding lost and out of order packets and technologies like LACP and spanning tree help tremendously

so the first question is how you are ‘evenly’ distributing the send queues over those cores?

I do keep queues localized per NUMA node when distributing them. At the moment my design is having a CPU->queue map for every CPU (mapping is semi arbitrary), and letting every CPU write to its mapped queue, but I’m planning on seeing if it’s better to reschedule NET_BUFFER_LISTs on the CPU the queues are actually affinitized to for cache locality.

The next question is, are you actually seeing a problem?

That’s a good question, actually. I’ve seen quite a few retransmits (~7% TCP retransmits on 50 Gbps throughput) when testing with ntttcp.exe, but I haven’t actually verified that it’s an out of order problem via tcpdump due to difficulties with the test environment. NDISTest does flag a large number of out of order packets, but I wouldn’t be surprised if NDISTest is just sending NET_BUFFER_LISTs round-robin across CPUs and assuming we should synchronize.

It sounds like I should bite the bullet and get a tcpdump to verify that it’s actually an out of order issue causing retransmits :frowning:

What parameters are you using with ntttcp? IIRC it can behave quite differently depending on how you invoke it.

I’m not sure I understand your queueing scheme. are you saying that any CPU in any NUMA node can queue packets to a queue related to any NUMA node - including nodes other than its own? If so, this will clearly be a performance issue notwithstanding the TCP packet order issue since accessing non-local memory will be more costly. There is of course no guarantee that the memory that CPUx is working with is actually local to CPUx but it is more likely

I have not worked with 50 Gb/s Ethernet specifically, but on a LAN 7% retransmits seems high. I have seen traces where 20+% of packets arrive out of order and virtually no resend processing is done, so either the timing is bad or packets are actually lost instead of just badly ordered

but the question I guess I should have asked already is if you are trying to distribute the packets for a single TCP connection or to scale with many TCP connections?