Understanding which core does 'work' and why

I’m trying to solve an optimization problem and am hitting my head on my shallow understanding of Windows internals.

We have an NDIS filter driver that MOSTLY just passes all requests through it with the exception of certain packets of type GVSP (GigE Vision Stream Protocol). Those packets it queues up and then assembles into a completed image frame for the user.

So the user application is doing normal network read/write but it ALSO does an IOCTL to the driver passing in a buffer big enough for a completed frame. That IOCTL blocks until the frame is complete at which time control returns to the caller.

The point of this is to avoid the extra copies that would be required by the user to assemble the packets into a completed frame. Doing it directly in the driver makes for one less copy.

Here’s where my problem arises. We have a situation where we have an application capturing and processing frames from a 10G NIC. It can easily be seen that one or more of the CPU cores are much more ‘busy’ than the others and every now and then we get a dropped frame. If we adjust the affinity mask of the application to avoid the ‘busy’ cores, we no longer get any dropped frames.

My assumption is that leaving it up to the Windows scheduler causes the CPU usage on that core to occasionally spike just a bit too high resulting in a dropped frame. So this says to me that setting the affinity of the application is a reasonable thing to do.

I’m trying to understand why there is this concentration of activity on some particular cores. I’m also trying to discover a deterministic way to know WHICH core is the ‘busy’ one so that I can set the affinity mask of the application properly. My initial guess is that it’s the IRQ affinity of the NIC since our ‘driver’ is just a filter and doesn’t have any threads. I was playing with the Interrupt Affinity Policy Tool to set affinity for the NICs but it doesn’t seem to have any impact.

My understanding is that a user call to a kernel driver is simply the same thread (and likely CPU core) continuing from user to kernel space. I also understand that IRQs are serviced by co-opting whichever thread happens to be running on the core at the time. I don’t entirely understand where packets travelling up the network stack fit into this.

Sorry for the long winded post. I’m hoping someone can give me an idea if I’m on the right track with IRQ affinity or if there are other suggestions to try. I don’t see any way within our driver to discover which core to avoid or to control which core the NIC interrupts are delivered.

Tangential point: I hope that’s not *literally* true. I hope you mean “The IOCTL is held in PENDING state, until it is returned to the caller” – But what I know about NDIS filters can fit in a thimble and still have room left over.

Windows does some pretty sophisticated scheduling… but it’s impossible for it to be optimal for all uses. Windows will strongly prefer the core on which a thread last ran – This is called the Ideal Processor. If the Ideal Processor is not available, Windows will try another core on the same NUMA node or that’s within the same SMT CPU group. The Ideal processor for a given thread is chosen round-robin among the cores in a given NUMA/SMT group.

True, but only guaranteed for the first driver entered by the I/O Subsystem. That driver can do various things (pend the request and come back to it later, for example) that could change the core.

Also true. Bravo for knowing this. Truly. So few people “get” the whole idea of “thread stealing” which is what we call it at OSR.

Well, to be honest, I’m not really sure either. We DO have a very good lead from the NDIS team who frequently reviews posts on this list and mayhaps he’ll chime-in.

I can tell you that, in general, the interrupt architecture strives to keep the work running on the same core as that in which the interrupt arrived. In NDIS, the core to which the interrupt arrives can be determined by many different things… even by the way the network device itself (seriously… the device, using MSIx, can say “I want to interrupt the system, and I want the ISR to run in THIS particular core number”).

I dunno… I hope some of that helps,

Peter
OSR
@OSRDrivers

You almost got it right…

Ingress packets get indicated to the bound protocols as a result of the calls to NdisMIndicateXXX that are made by miniport drivers in their MiniportInterruptDPC() routines. NdisMIndicateXXX routine gets invoked in context of DPC that is queued by MiniportInterrupt() one that gets invoked when NIC requests an interrupt.

Apart from other parameters, MiniportInterrupt() receives a pointer to a bitmap that it can modify
in order to specify the target processors for which NDIS should schedule a DPC. Therefore,
a miniport driver is in a position to decide by which particular core ingress packets will get actually processed. For example, if miniport always sets it to zero the packets will always get processed by the core 0. I think it fully explains why you are getting some “exceptional activity” only on some certain core(s) while all other ones don’t seem to be affected at all

Anton Bassov

@Peter Thanks. Re: IOCTL that’s a good question. I should study that code because there’s a very brief timeout associated with it so it might be that it actually IS waiting rather than returning PENDING.

@Anton Thanks. That certainly matches our observations. The net conclusion seems that our poor little filter driver really doesn’t have any ability to influence this. In looking at the NIC documentation there is a bunch of stuff about setting RSS Base Processor Numbers for NUMA nodes and other things I don’t fully understand. It seems like it can be tuned but not necessarily by us.

I think I have enough understanding to ‘push back’ the problem to the people building the system who were insisting that we tell them “on which core our driver was running.” :smiley:

Sorry, but setting interrupts affinity is not a NDIS-specific feature. Furthermore, IIRC IoConnectInterrupt() had allowed a caller to specify interrupt affinity long before MSI and MSI-X were even conceived. What makes NDIS special in this respect is that it allows you to specify DPC affinity on per-case basis every time miniport’s ISR gets invoked…

Anton Bassov

Sorry, but I’m afraid the issue is more complex than you’re making it out to be.

First, NICs are one of the few types of cards that regularly make use of MSIx… which allows the card itself to direct the interrupt to a given CPU. This isn’t the DRIVER making the decision, it’s the CARD.

Second, IIRC NDIS did indeed (and PERHAPS still does) have some unique mechanisms for the default routing of interrupts and/or DPCs. Because I haven’t worked in the NDIS space for a *very* long time (like, we’re talking NT V4 days), I’m not up on what’s going on there these days. But there used to be some sort of default interrupt coalescing, and DPCs were routed to (again, IIRC) the lowest number physical processor. Now, granted, this was all before NUMA.

Third, that bitmask that’s always been part of IoConnectInterrupt has historically been fraught with danger and difficulties. Setting it has historically required that you have pre-ordained knowledge of the physical interrupt routing of your mainboard. Select the “wrong” set of processors, and you get *really* bad results. This is why the general guidance for this parameter has always been to set the KAFFINITY to 0xFFFFFFFF. Which, of course, is now obsolete in any case, in light of processor groups.

So… NOT so simple, really.

You know what? I miss Thomas Divine right now. Sigh.

Peter
OSR
@OSRDrivers

Well, the very idea of MSI and MSI-X is that a device can request an interrupt on a CPU as a memory write ( namely, by writing the contents of the Message Data Register to the address contained in the Message Address Register), which, first, spares a driver the need to go to the device in order to find out whether it had actually interrupted, and, second, gives you a chance to avoid interrupt sharing. However, both message destination and the actual contents of the message have to be defined by the system-level software upon device configuration.

Therefore, the only thing that a device can do is to request an interrupt, but all the actual decisions concerning the way it gets processed (including interrupt affinity) have to be made in advance by the software. If you need more precise info you can check Section 10.11 of Intel Developer’s Manual - it goes into a great detail about the layouts of both above mentioned registers

< comical mode>

BTW, do you remember a “funny” thread where our good old friend Alberto was trying to “use MSI” by making the CPU write to its own local APIC’s ICR. I am not sure if it happens to be the same thread where MR.Kyler provided “the only correct spinlock implementation in existence”, which, in his opinion, is a tight polling loop of interlocked operations - the only thing that I remember for sure is that Alberto participated in that one as well. IIRC, he was claiming to have had found a way to implement a spinlock on the system that does not support bus locking, on this thread…

I think that you are speaking about the so-called serialised miniport drivers. Back in the days of NDIS 5.0 and 5.1 we were advised to write deserialised ones, but serialised ones were still supported. They were officially deprecated in NDIS 6.0

You know what? I miss Thomas Divine right now. Sigh.

Me too. Thomas was my favorite poster for sure, with the profound knowledge of ins and outs of all the internals of Windows networking, and always willing to share his knowledge…

Anton Bassov

You know, Anton, I *hate* it when you argue with me… especially when you don’t really have any relevant experience to backup what you’re arguing about.

Sigh. But the POINT is the DEVICE selects which message to generate. Like the NIC will signal different interrupts for transmit complete, receive complete, and “I need you to send me more buffers” – AND it can target those messages to particular processors.

Why are we arguing about this? Is it just so you can show that you’ve read the Intel manuals?

Nope. That’s emphatically *not* what I’m talking about.

Anton, have you ever written an NDIS driver for a NIC? If not, please just stop posting on this topic.

Thank heavens we agree on SOMEthing!

Peter
OSR
@OSRDrivers

> You know, Anton, I *hate* it when you argue with me…

Well, I have figured it out VERY long ago…

Is it just so you can show that you’ve read the Intel manuals?

Actually, the only reason why I referred you to Intel Manuals is to make you see the layout of
MSI-related registers. At this point you would realise that a device in itself simply does not and cannot have sufficient knowledge of the details of the host system that is required for generating MSI. For example, the destination ID of message address register corresponds to certain bits of the IOAPIC Redirection Table Entry, and message address register contains, apart from other things, interrupt vector number. Therefore, the target message has to be pre-configured by the system software.

In terms of software, the logic behind the whole thing can be described as

if (interrupt_needed) generate_interrupt(handle);

where 'handle"is an opaque handle provided by the caller of the imaginary register_interrupt_message() function, and refers to some internal details that are totally transparent to generate_interrupt() caller.

MSI-X extends this approach further, and allows a device to generate multiple pre-configured messages, which potentially allows the same device to have multiple ISRs that are serviced via different interrupt vectors. In software terms it can be described as

if(reason_a) generate_interrupt(handle_x);
else
if(reason_b) generate_interrupt(handle_y);
else
(etc)…

However, in both cases it is not a device who choses the target CPU(s) and vector(s).
The only thing it does is generating pre-configured messages. This is the only thing that I am saying. I really have no idea why it pisses you off…

Anton, have you ever written an NDIS driver for a NIC?

A virtual one, but it was so long ago - it was definitely well before Vista had arrived. Admittedly it was not a production driver. The whole purpose of the exercise was an interactive disassembly of NDIS.SYS with SOFTICE. I was working on a packet sniffer/filter, and “conventional” NDIS IM filters were rather problematic at the time.

Therefore, based upon NDIS WDK samples, I replicated the entire stack( i.e. miniport-IM-protocol) for the sole purpose of seeing how the network packets travel up and down the stack; how supposedly opaque NDIS_HANDLES that you pass to NDIS functions map to the undocumented structures in NDIS.H; and, in general, how the entire network stack gets bound together by above mentioned undocumented structures and how all these pieces fit together into the same picture.

Certainly I did not have the audacity to ask Thomas for help-- after all, he was offering his NDIS hooking samples at the time, so that asking him too many questions would be pretty much
the same thing as saying “I want to develop more or less the same “not-so-conventional” product that you offer. Could you please share your “forbidden knowledge” with me”.

Therefore, I was completely on my own, with SoftIce,samples and NDIS.H being the most informative sources that I had. I order to figure out all the heads and tails I had re- written a miniport driver in multiple versions and configurations just in order to see how these changes would affect interactions with the upper-layer drivers.

In other words, I am not really a NDIS newbie, you know…

Sadly enough, all this “forbidden knowledge” was largely obsoleted by NDIS 6…

Nope. That’s emphatically *not* what I’m talking about.

Fair enough…

To be honest, MP-related issues were the last thing that I worried about at the time. The thing is, as you must know, SoftIce had a HUGE trouble running on MP systems. In my experience, it froze the system the very moment its window had popped up if multiple cores/threads were enabled. Therefore, in order to be able to do what I was doing without freezing the system I had to disable MP in BIOS settings every time I started a debug-disasm session

Anton Bassov

Because you’re being pedantic and condescending by presuming to teach me how MSI-X works. And you’re acting like I’ve never seen an MSI-x sequence on a PCIe Bus Analyzer. Which is, you know, insulting and annoying.

And because you’re fixating on the details that you apparently know and love, but are not relevant to the discussion. To me, your argument seems pretty remedial because you’re basically arguing that, although the device can choose the interrupt and the target processor, it does so by using a semi-opaque value that has been provided to it.

To repeat:

But the POINT is the DEVICE selects which message to generate.

We’re well beyond helping the OP. As a result, I’m locking this thread.

Peter
OSR
@OSRDrivers