Prior to Vista, the receive could happen on any CPU, depending on where the interrupt controller felt like delivering the interrupt. Typically this would be CPU0. The OS did not attempt to shift load around to other CPUs once the interrupt was received, since that would not really improve perf. Part of the purpose of MSI-X is that the hardware now has some say in which CPU gets interrupted. (Or at least, which vector is signaled, which the OS can carefully map to select CPUs).
We do support RSS without the ability for hardware to target the CPU (that’s why we have NdisMQueueDpc[Ex], so your ISR can do the routing). But it’s not as efficient as MSI-X, since the ISR still happens on a single CPU. And this isn’t implemented prior to the SNP.
I think the problem is that out-of-order packet processing … introduces hefty performance penalties in NDIS
Yes, you really have to avoid splitting TCP or UDP streams, else TCPIP has to take a slow reassembly path. Thus maintaining the order of TCP/UDP packets within a stream is critical for any NIC that cares about perf. That’s why RSS goes to such great lengths to hash the tuples. NDISTest will also print warnings in the debugger if it detects that you delivered any packets out-of-order. If you see even a few warnings about this, you have a code bug.
To go back to the original question… I believe WDF works with RSS, but I haven’t personally tried it. You’d need to run WDF in miniport-mode, and *not* run NDIS in WDM mode. Basically, *somebody* has to call IoConnectInterrupt for you, and it can either be NDIS or WDF. If you let NDIS do it, then you get the OS’s RSS support. This means you need to use NDIS APIs to implement your ISR and DPCs. At that point, you won’t be using WDF for much – maybe just a random device object or something.
If your device is 1G than on modern computers there is no technical reason to use RSS. One can reach 10G with one core without any special problems.
True, a modern system can run ntttcpr.exe on 1G without even denting the CPU. But if you’re pushing a lot of packets *and* somebody above the NIC needs lots of CPU to process all that traffic, then sometimes it’s beneficial to indicate on multiple CPUs so that the application listening on sockets run on multiple cores. But any gains are highly dependent on what is going on in usermode. I’d agree that consumer-grade NICs don’t need RSS (and all they’ll do with your NIC is download a few kbps from youtube.com anyway).
-----Original Message-----
From: xxxxx@lists.osr.com [mailto:xxxxx@lists.osr.com] On Behalf Of James Harper
Sent: Wednesday, March 30, 2011 4:09 PM
To: Windows System Software Devs Interest List
Subject: RE: [ntdev] RSS and MSiX Support in KMDF
>Under 2003 (without SNP) and below, NDIS processing always seemed to
>happen on CPU0.
Can some person at MS confirm this? for me, this statement looks
extremely
strange.
I’m not Microsoft obviously, but the Microsoft RSS documentation for SNP states (http://download.microsoft.com/download/5/d/6/5d6eaf2b-7ddf-476b-93dc-7c
f0072878e6/ndis_rss.doc) that “the ability of the networking protocol stack of the Microsoft(r) Windows(r) operating system to scale well on a multi-CPU system is inhibited because the architecture of Network Driver Interface Specification (NDIS) 5.1 and earlier versions limits receive protocol processing to a single CPU”. I think you may be able to change it but my testing shows it always ends up on CPU0
>could happen on another CPU. Can receive processing be targeted to
>another CPU (even if its always the same CPU) under Vista and above
- NIC interrupt (actually any interrupt) is directed by IO APIC,
which can
direct it to any CPU.
- DpcForIsr, including ndisMDpc, runs on the same CPU as the NIC
interrupt.
- Receive path is called from ndisMDpc.
Am I wrong?
I actually tried that in my driver (without thinking about the consequences and before I knew anything about RSS) and performance was awful. I think the problem is that out-of-order packet processing - a consequence of processing happening on different CPU’s concurrently - introduces hefty performance penalties in NDIS. RSS gets around this by calculating a hash based on the connection-specific items in the packet header and dishing out the work to a fixed processor per hash value, so per-connection packets are guaranteed to arrive in the same order that they arrived on the hardware. Without MSI the interrupt comes to an arbitrary CPU but the DPC is scheduled to the correct CPU. With MSI I believe that you can have one MSI per CPU and with correct hardware support the interrupt can be generated on the correct CPU from the start.
James
NTDEV is sponsored by OSR
For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars
To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer