Given Interrupt ALWAYS Arrives on the Same Processor

Peter_Viscarola_OSR · December 6, 2021, 9:56pm

I wanted to correct something that I posted a few weeks back.

In general, all of the interrupts for a single device are routed to the same CPU

to which I replied (you can see it here):

Hmmmm… really? I’ve never heard this in all the years I’ve been writing drivers… and I’m not sure it matches my experience.

But this continued to bother me… because (a) Mr. Roberts is so rarely “wrong” on such things, and (b) I’m not sure I’ve actually spent any serious time looking at which CPU the same interrupt number arrives on for some time.

So… Having a piece of hardware, and a driver under development, I decided to take a look.

And sure enough, Mr. Roberts was correct.

Not only is this clear from the WDFKD output:

But I added some simple code to my ISR:

And after several thousands of interrupts… No DbgPrints.

So… THANK you to Mr. Roberts for setting me straight!

Peter

Peter_Viscarola_OSR · December 6, 2021, 9:59pm

As an aside, this fact (that a given interrupt for a given device is always targeted to the same processor) actually distresses me mightily. And it’s not that easy to change the Affinity Policy for MSI/MSI-X type interrupts.

So… yeah. Now that I’m correctly informed, I’m unhappy.

Peter

Pavel_A1 · December 7, 2021, 1:03am

Maybe, try a fancier machine, with NUMA…?

MBond2 · December 7, 2021, 1:42am

I would expect that interrupts on NUMA machines should be more localized than those on non-NUMA machines. And the same applies to larger machines that have processor groups. You would not want an ISR to run on a core that has to perform non-local memory access

And from a system point of view, it is probably better to route all interrupts from a single device to a single core. Like all performance questions, this can surely vary dramatically based on the exact hardware and the workload that is being applied, but presuming that there are multiple devices in the system that are critical for the performance of that workload, it would make sense to configure them in such a way that they would minimally interfere with one another as well as to minimally interfere with CPU bound work that might be scheduled on other cores.

I’m sure its more complex, but those are some initial thoughts

Peter_Viscarola_OSR · December 7, 2021, 4:19am

@Pavel_A No difference…. Did you see the interrupt policy in the wdflogfump?

You would not want an ISR to run on a core that has to perform non-local memory access

But you might want your ISRs spread across ALL the near processors.

it is probably better to route all interrupts from a single device to a single core

I don’t see it. But, you know, that’s not unusual…

Peter

msr · December 7, 2021, 12:11pm

Hi

Probably not related directly, but I had to specifiy different affinity policy (5) to spread my MSI-X to different logical-procs (MsgID 0 gets assigned to one LogicalProc (LP) and then each gets assigned in round-robin. I had 32 MSI-X), else all 32 were assigned to 2 physical cores (LP0-LP3) - my client machine had some 12 to 16 LPs’.

But yes I think once MsgId is routed to an LP, it remained so.,

https://docs.microsoft.com/en-us/windows-hardware/drivers/kernel/interrupt-affinity-and-priority

I was unable to control affinitizing MSI-X to LP programatically

Also one thing I noticed is the processor-id info in the ETL logs was wrong at times (mostly in ISR/DPC prints), I had to use KeGetCurrentProcessorNumberEx(NULL), myself and append it to my strings to be sure of the LP the isr/dpc is on…

Thanks

Mark_Roddy · December 7, 2021, 3:53pm

In the case of an ‘almost always busy’ device, I think that keeping it’s almost always asserted interrupt local to one cpu would be the most efficient way to process that device. Otherwise you are going to be competing across cpus to access device resources and potentially churning your cpu cache as well.

MBond2 · December 8, 2021, 2:01am

Yes, you might want ISR execution to be spread among ‘near’ processors. But ISR execution should be short and the time taken should be dominated by device access. Confining ‘busy’, ‘chatty’ or storming devices to a single processor is obviously better for overall system performance, but even when the device is functioning properly, how many truly independent interrupts do most devices have?