I’m trying to solve an optimization problem and am hitting my head on my shallow understanding of Windows internals.
We have an NDIS filter driver that MOSTLY just passes all requests through it with the exception of certain packets of type GVSP (GigE Vision Stream Protocol). Those packets it queues up and then assembles into a completed image frame for the user.
So the user application is doing normal network read/write but it ALSO does an IOCTL to the driver passing in a buffer big enough for a completed frame. That IOCTL blocks until the frame is complete at which time control returns to the caller.
The point of this is to avoid the extra copies that would be required by the user to assemble the packets into a completed frame. Doing it directly in the driver makes for one less copy.
Here’s where my problem arises. We have a situation where we have an application capturing and processing frames from a 10G NIC. It can easily be seen that one or more of the CPU cores are much more ‘busy’ than the others and every now and then we get a dropped frame. If we adjust the affinity mask of the application to avoid the ‘busy’ cores, we no longer get any dropped frames.
My assumption is that leaving it up to the Windows scheduler causes the CPU usage on that core to occasionally spike just a bit too high resulting in a dropped frame. So this says to me that setting the affinity of the application is a reasonable thing to do.
I’m trying to understand why there is this concentration of activity on some particular cores. I’m also trying to discover a deterministic way to know WHICH core is the ‘busy’ one so that I can set the affinity mask of the application properly. My initial guess is that it’s the IRQ affinity of the NIC since our ‘driver’ is just a filter and doesn’t have any threads. I was playing with the Interrupt Affinity Policy Tool to set affinity for the NICs but it doesn’t seem to have any impact.
My understanding is that a user call to a kernel driver is simply the same thread (and likely CPU core) continuing from user to kernel space. I also understand that IRQs are serviced by co-opting whichever thread happens to be running on the core at the time. I don’t entirely understand where packets travelling up the network stack fit into this.
Sorry for the long winded post. I’m hoping someone can give me an idea if I’m on the right track with IRQ affinity or if there are other suggestions to try. I don’t see any way within our driver to discover which core to avoid or to control which core the NIC interrupts are delivered.