Multiple MiniportSendNetBufferLists

Hi,
According to MSDN:

The order of the linked list of NET_BUFFER_LIST structures that NDIS passes at NetBufferLists represents the order in which the miniport driver should transmit the network data. In addition, a miniport driver should send the NET_BUFFER_LIST structures from multiple MiniportSendNetBufferLists calls in the order in which the miniport driver received the structures.
http://msdn.microsoft.com/en-us/library/windows/hardware/ff559440(v=vs.85).aspx

The last sentence confuses me,
“a miniport driver should send the NET_BUFFER_LIST structures from multiple MiniportSendNetBufferLists calls in the order in which the miniport driver received the structures.”

Does “multiple MiniportSendNetBufferLists calls” refer to :

  1. multiple MiniportSendNetBufferLists that are called by the OS one after the other on the same CPU core,
  2. multiple MiniportSendNetBufferLists that are called by the OS simultaneously on several CPU cores.

If it refers to 2 or both, how can the driver keep the send order when 2 cores asks for it simultaneously?

Thanks

On 30-Sep-2012 12:49, xxxxx@gmail.com wrote:

Hi,
According to MSDN:

The order of the linked list of NET_BUFFER_LIST structures that NDIS passes at NetBufferLists represents the order in which the miniport driver should transmit the network data. In addition, a miniport driver should send the NET_BUFFER_LIST structures from multiple MiniportSendNetBufferLists calls in the order in which the miniport driver received the structures.
http://msdn.microsoft.com/en-us/library/windows/hardware/ff559440(v=vs.85).aspx

The last sentence confuses me,
“a miniport driver should send the NET_BUFFER_LIST structures from multiple MiniportSendNetBufferLists calls in the order in which the miniport driver received the structures.”

Does “multiple MiniportSendNetBufferLists calls” refer to :

  1. multiple MiniportSendNetBufferLists that are called by the OS one after the other on the same CPU core,
  2. multiple MiniportSendNetBufferLists that are called by the OS simultaneously on several CPU cores.

If it refers to 2 or both, how can the driver keep the send order when 2 cores asks for it simultaneously?

An answer (a simple one) is few lines down in the text…

“The MiniportSendNetBufferLists function must synchronize access to its
internal queues of network data … A miniport driver can use spin
locks to synchronize access to the queues.”

A more complicated answer is that some drivers actively break rule 2
by reordering data coming from different MiniportSendNetBufferLists
calls, based on priority, destination or whatever.
People that do this are supposed to know what they do and why.

Regards,
– pa

I don’t understand how synchronization can help us here,
For example:

  1. Driver gets MiniportSendNetBufferLists call on packet with sequence 0x1 on CPU 1
  2. CPU 1 gets some interrupt
  3. Driver gets MiniportSendNetBufferLists call on packet with sequence 0x2 on CPU 2
  4. Driver catch lock , add packet 0x2 to internal TX queue and release lock on CPU 2
  5. Driver sends packet 0x2 on CPU 2
  6. CPU 1 finish handling interrupt, return to handle MiniportSendNetBufferLists of packet 0x1,
    catch lock, add packet 0x1 to internal TX queue, release lock and sends packet 0x1

In this example, the driver changed the sending order of the received NBLs,
I don’t understand how can it be done even with spin locks.

Right, you’ve identified a fundamental race that is impossible to solve. Fortunately, you’re not expected to solve it.

What we really need from you is a best-effort for three things:

  1. If a single call to MiniportSendNetBufferLists has two packets A1 and A2; you will transmit A1 before transmitting A2.
  2. If we call you at DISPATCH_LEVEL on CPU X to send packet B1, then later call you again at DISPATCH_LEVEL on CPU X to send packet C1; you will transmit B1 before transmitting C1.
  3. If we tell you to send packet D1, and you call NdisMSendNetBufferListsComplete(D1), then later we tell you to send packet E1; you will transmit D1 before transmitting E1.

You may break the above rules, if instructed to by QOS protocols like 802.1P, WMM, or DCB. You are allowed to occasionally bend one of the rules in limited corner cases, if you must (e.g., perhaps immediately after resuming from low-power, you flush a queue of packets that could temporarily race with new incoming packets). The purpose of these rules is to try to keep the recipient on the fast-path of in-order packet delivery. The recipient must correctly handle packet re-ordering, but we don’t want to force him to take that slower path in the common case. So if you make a rare reordering once every few hours, that’s not a big deal. But the more you reorder packets, the user’s network applications will get slower.

All three requirements are achievable if your send handler looks something like this:

MiniportSendNetBufferLists(NblsToSend)
{
AcquireLock();

while (NblsToSend)
{
Nbl = PopListHead(NblsToSend);
PushListTail(Adapter->TxQueue);
}

ReleaseLock();
}

Of course, there are ways to optimize this. But the point is, you keep the NBLs roughly ordered within a call, and between calls (on the same CPU). What we *really* don’t want you to do is to something like this:

MiniportSendNetBufferListsBad(NblsToSend)
{
// bad - reorders the NBL chain
InterlockedPushListSList(Adapter->TxQueue, NblsToSend, …);
}

(In general it’s dangerous to use SLISTs for the TX path. It can be made to work, but you have to think carefully about reordering.)

The NDISTest framework is pretty good about letting you know if you’ve reordered NBLs. I think it tolerates some amount of reordering at high throughput, but is stricter on low-throughput tests.

In contrast to the Send and Receive paths, the SendComplete and ReceiveReturn paths do not care about order of NBLs. So feel free to gratuitously alphabetize the NBLs however you want in SendComplete and ReceiveReturn. (NDIS itself will haphazardly reorder packets in these handlers in some cases.)

Things get more interested when the NIC has 16+ tx chains with TSS enabled
and the computer has 128 cores-:slight_smile: I’m not making it up though.

Thanks Jeffrey for clarifying that out.
Really appreciate it!

Another related question:

  1. Can the driver influence the OS in such a way that the OS will call MiniportSendNetBufferLists using a single core?
  2. Can an APP configure UDP socket in such a way that the OS will call MiniportSendNetBufferLists using a single core? (for example we see that buffer size socket configuration influence OS core usage when calling MiniportSendNetBufferLists)

Thanks

> 1. Can the driver influence the OS in such a way that the OS will call MiniportSendNetBufferLists using a single core?

Not anymore. In the Before Time, all NDIS miniports were serialized like this. But this did not scale well, so the NDIS design was changed and send handlers were deserialized. Part of the problem is that not *all* of your send handler needs to be serialized. For example, if you parse and validate NBLs, that can happen in parallel on multiple cores; serializing *that* will just result in needless performance degradation. (And as Calvin proudly observed, some hardware is designed to be accessed from multiple CPUs in parallel, so the whole send hander could theoretically operate without a spinlock.)

  1. Can an APP configure UDP socket in such a way that the OS will call MiniportSendNetBufferLists using a single core?

Generally the OS will try to I/O data on the same CPU core, all the way from the application to the miniport driver. Therefore, if your app uses an API like SetThreadGroupAffinity to fix itself to a single core, then most (not all) transmit packets should hit your miniport on that same core.

HOWEVER, explicit affinitization is usually a sign of somebody trying to hide race bugs. There are actually surprisingly few legitimate uses of affinitization. It is never appropriate to use affinitization to achieve multi-processing correctness; you must use spinlocks and other synchronization mechanisms for that. Affinitization is also rarely appropriate to achieve better performance, since you’re essentially saying that you know better than the OS’s scheduler. (And the scheduler is very smart.) Usually it’s only possible to beat the scheduler if you have a highly-constrained scenario (e.g., a fixed hardware platform), and you’re willing to do careful performance benchmarking to make sure that you are indeed eking out an advantage over the default case.

If you are new-ish to parallel programming, it’s totally appropriate to just slap one giant spinlock around your MiniportSendNetBufferLists handler and ship it. As you get more comfortable with parallel programming, you can break that giant lock apart into more granular (and thus hopefully, more performant) locks.

> 1. “Not anymore. In the Before Time, all NDIS miniports were serialized like this.
But this did not scale well, so the NDIS design was changed and send handlers
were deserialized.”

I see a different behaviour between win7 and win8 regarding multicore usage on MiniportSendNetBufferLists.
When was the deserialization introduced (xp,vista,win7,win8?)

Thanks

On 04-Oct-2012 01:34, xxxxx@gmail.com wrote:

> 1. “Not anymore. In the Before Time, all NDIS miniports were serialized like this.
But this did not scale well, so the NDIS design was changed and send handlers
were deserialized.”

I see a different behaviour between win7 and win8 regarding multicore usage on MiniportSendNetBufferLists.
When was the deserialization introduced (xp,vista,win7,win8?)

Win2000 or even NT.

– pa

IIRC NT4 SP3.

Michal

-----Original Message-----
From: xxxxx@lists.osr.com [mailto:bounce-515343-
xxxxx@lists.osr.com] On Behalf Of Pavel A
Sent: Thursday, October 04, 2012 1:49 AM
To: Windows System Software Devs Interest List
Subject: Re:[ntdev] Multiple MiniportSendNetBufferLists

On 04-Oct-2012 01:34, xxxxx@gmail.com wrote:
>> 1. “Not anymore. In the Before Time, all NDIS miniports were serialized like
this.
> But this did not scale well, so the NDIS design was changed and send handlers
> were deserialized.”
>
> I see a different behaviour between win7 and win8 regarding multicore usage
on MiniportSendNetBufferLists.
> When was the deserialization introduced (xp,vista,win7,win8?)

Win2000 or even NT.

– pa


NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

NOTE: The information in this message is intended for the personal and confidential use of the designated recipient(s) named above. To the extent the recipient(s) is/are bound by a non-disclosure agreement, or other agreement that contains an obligation of confidentiality, with AuthenTec, then this message and/or any attachments shall be considered confidential information and subject to the confidentiality terms of that agreement. If the reader of this message is not the intended recipient named above, you are notified that you have received this document in error, and any review, dissemination, distribution or copying of this message is strictly prohibited. If you have received this document in error, please delete the original message and notify the sender immediately.
Thank You!
AuthenTec, Inc. http://www.authentec.com/