Increasing ndis filter driver speed with deferred procedure calls on target processors

Hi everyone, i started working not so long ago with ndis light-weight filter drivers, and i have a task to increase speed of filter receive process. I have a driver which make some long calculations with every packet on FilterReceiveNbls, so that operation reduces bandwith of a network stack, especially, when there is a 10 Gbit/sec adapters. I’m trying to increase the speed by using additional processors (logical processors) on multi-core cpu with hyperthreading. The main problem with rx processing is that it always work on a 0 logical processor (interrupt in miniport is triggered and dpc is called from ISR on a 0 processor). I am trying to fix this problem in such a way that for every FilterReceiveNbls call i make a dpc , which has a target processor different from current (0 default), installing importance MediumHighPriority, so when it should trigger dispatcher to raise irql on that processor and start processing dpc queue. When i send dpc on the single processor, i have a pretty good bandwith, but if i use all processors (for example, consistenly send it to 2, 4, 6, 8, 10, 2, 4, 6, etc…), than i get a degradation of speed, i cant understand why does it happens, maybe because of increase of IPI. These dpcs have a different queues, so i dont actually get it… please help … I actually observe that even cores are loaded evenly, but the speed decreases at the same time.

Forgot to say, in the deffered call itself, i I perform the same operations as was in the usual FilterReceiveNbls with long packet processing and indicating it up to stack.

Processor affinity is pretty tricky and best left to the OS to determine … it factors in memory location, physical adapter locations, etc. …

You’re doing the processing in a DPC (be careful for the watchdog, I would normally suggest using an APC level worker thread but you’re time sensitive so inside the DPC is probably the best idea) and I would suggest picking one to dispatch the DPC’s to.

There’s actually quite a bit of heavy lifting involved moving context information from one CPU to another and it totally messes up the lookahead logic which is a big part of optimization, so just pick one processor to be the DPC target and set that for the DPC’s …

1 Like

@craig_howard said:
Processor affinity is pretty tricky and best left to the OS to determine … it factors in memory location, physical adapter locations, etc. …

You’re doing the processing in a DPC (be careful for the watchdog, I would normally suggest using an APC level worker thread but you’re time sensitive so inside the DPC is probably the best idea) and I would suggest picking one to dispatch the DPC’s to.

There’s actually quite a bit of heavy lifting involved moving context information from one CPU to another and it totally messes up the lookahead logic which is a big part of optimization, so just pick one processor to be the DPC target and set that for the DPC’s …

but if im gonna use only one processor, then there will be no increase in speed, is there any other way to achieve performance gains on multi-core processors&?

It’s not so much a speed increase as getting out of the way of the OS … the OS will attempt to use processor 0 for ISR’s and unaffiliated DPC’s (yes peanut gallery, I know that there are NUMA factors at work here) so do your processing on a processor that the OS is leaving alone (1,2,3, etc.) … by doing all of the work on one processor you can take advantage of code lookahead caching of your DPC itself …

The one time that I tried to make up my own processor affinity scheme for handling infiniband traffic (and quickly came to realize that the OS is smarter than I am) I found it fastest to leave processor 0 alone and have my things with processor affinity on processor cores 1, 2 and 3 for a quad core machine … note that physical and logical cores make a difference here, you want to stay on the same physical core so for a dual core machine you can use cores 0 and 1, on a quad 0, 1,2 and 3 and so on …

As i undestood, you mean that if i have for example physical processor with 4 cores with hyperthreading, than i have 8 logical processors, and i should use only 0 and 1 (first physical core) logical processors.

Correct … keep your affinity fiddling on a single physical CPU die which is then broken into logical cores. Here’s a bit of background [ https://unix.stackexchange.com/questions/88283/so-what-are-logical-cpu-cores-as-opposed-to-physical-cpu-cores ] as well as [ https://meribold.org/2017/10/20/survey-of-cpu-caches/ ]

What you’re trying to do is keep as much of your program’s program and data on the same physical processor’s cache lines. If you don’t do this then the processors will be forced to copy that information back and forth (hence the large number of IPI’s that you will see). Logical cores are actually not full “cores” in the strict sense of the word, so they share the same cache lines … so if we can localize all of the data processing onto a single physical core we gain from the prefetch queues.

The OS is going to use core 0 by default (physical core 0), so we need to set our processor affinity to the same physical chip. Note that not all physical cores support four logical cores; it’s important to check that in PrepareHardware (see other postings here to see how to do that) and act accordingly …

There is another important consideration: what’s the nature of your ‘long calculations’? If they can be made to be independent of one another on a per packet basis, then you have the possibility of increasing your total aggregate throughput (not necessarily the latency of any one) by using multiple cores and even multiple hyper-threads within the same core. But if the calculation on one packet depends on the completion of a calculation from a previous packet or is a prerequisite for a calculation on a future packet, then multiple cores will decrease performance for sure. In that case the calculation is still serial (one core or thread) but you add the overhead of signaling other CPUs to do the work and of collecting the results

usually though, not every bit of information needs to be processed in exactly the order that it arrives in. And if you can figure out which ones don’t, and how to group them together into chunks that do need to be processed in order, your can effectively leverage additional CPU cores to process multiple streams of data in parallel

> @MBond2 said: > There is another important consideration: what’s the nature of your ‘long calculations’? If they can be made to be independent of one another on a per packet basis, then you have the possibility of increasing your total aggregate throughput (not necessarily the latency of any one) by using multiple cores and even multiple hyper-threads within the same core. But if the calculation on one packet depends on the completion of a calculation from a previous packet or is a prerequisite for a calculation on a future packet, then multiple cores will decrease performance for sure. In that case the calculation is still serial (one core or thread) but you add the overhead of signaling other CPUs to do the work and of collecting the results > > usually though, not every bit of information needs to be processed in exactly the order that it arrives in. And if you can figure out which ones don’t, and how to group them together into chunks that do need to be processed in order, your can effectively leverage additional CPU cores to process multiple streams of data in parallel Got you, every operation on a packet doesn’t depends on a prev or next packet, so it can work independently, so how can I parallel my work, is it possible using dpcs?

@MBond2 is correct, parallelization can give you benefits but can also very easily slow things down to a crawl and cause all kinds of subtle bugs and performance issues. Some GoogleFu on “HPC” and “memory fences” and “MPI” is a good place to start, but in essence it’s the idea of having the work broken up and the solutions added together later.

An example of this would be a JPG conversion of a pixel array. Suppose that my algorithm can do JPG’s of 9x9, and I’m operating on a picture that’s 81x81. There are two ways to do this … first I can simply traverse the picture, left to right and top to bottom encoding as I go. Some sections (like blank spaces) encode very quickly, others take much longer (like moire patterns) and eventually I get to the end and am done. With a single thread doing this it will take N time, done

Suppose now that I have four cores available, and I know that the 81x81 picture is composed of 81 of these 9x9 chunks. I start up four encoding threads, and task thread one with encoding chunks 0-20, thread two with chunks 21-40, thread three with chunks 41-60 and thread four with chunks 61-91. Each thread will set a completion event once it’s done with it’s encoding, and off they go. I have no idea which of the four threads will finish first, or last, only that once all four threads have signaled a completion I have encoded my picture. Using four threads this will now take N/4 time, done

There are complications, of course … but this is the gist of it … the key to success is that each “chunk” is not dependent upon the other chunks. For your application, if these “long calculations” can be broken down into similar non dependent portions then they may be candidates for parallelization … but again, they can’t be dependent upon each other …

So does your problem fit with this model? Can your calculations be broken up into independent parts?

if your calculation is truly ‘long’ then you almost certainly don’t want to do the calculations in a DPC. It really depends of what the nature of that calculation is which way you should go

@MBond2 said:
So does your problem fit with this model? Can your calculations be broken up into independent parts?

if your calculation is truly ‘long’ then you almost certainly don’t want to do the calculations in a DPC. It really depends of what the nature of that calculation is which way you should go

Sorry fo the long answer. Yes, for now im trying to develop algorithm, which is gonna use all physical processors for indicate processing. Im testing the next model, i break the packet chain ​​into 8-packet segments (specific of log operation), and sending every segment on a different queues, each queue is associated with a specific processor, and starting dpc on that processor (if its not already started), when i processed segment on a processor, I check if the next segment is in the queue, if it is present, than processing continued, until queue will not be empty, or processing limit for ont dpc will be reached. When queue becomes empty, i cancel that dpc, otherwise, if the limit is reached. the dpc is requeued. When i send all segments on one processors, it works good. But if i use all assecible processors, the speed is decreased, EVEN WITHOUT EXECUITING LONG OPERATION per packet. Is i can see, the time interval between two segment processing on a once processor increases, so the queue becomes empty much more often, and the exit condition described above is triggered. I was trying to expand a number of appends to get an item from the queue, so it doesnt help. Any suggestions?

Also my calculations > @MBond2 said:

So does your problem fit with this model? Can your calculations be broken up into independent parts?

if your calculation is truly ‘long’ then you almost certainly don’t want to do the calculations in a DPC. It really depends of what the nature of that calculation is which way you should go

Also, yes, my calculations can be broken up into segments with 8 packets each, and they dont depends on each other. You said “your can effectively leverage additional CPU cores to process multiple streams of data in parallel” how can i do that from ndis filter driver.

You’re doing a lot of DPC shuffling and such … every time you queue up a DPC, cancel a DPC, etc. is burning up thousands and thousands of CPU cycles (and probably thrashing the cache lines, slowing down things even more). You didn’t say how many segments you were doing at one time, I suggested you keep everything on one physical die (up to 8 logical cores on most CPU’s) so that’s eight segments and eight DPC’s.
You should have one and only one DPC triggered for each 8 packet segment, and when that DPC completes it updates it’s part of the solution then that’s it, done. Once all DPC’s are complete and the solution is complete then you can use it. There shouldn’t be any DPC cancels or requeues …

That advise seems odd. One physical die is not a normal unit to constrain an algorithm to. Usually a processor or core or NUMA node or a processor group. NUMA nodes often align with physical dies but not necessarily. On older systems, there may be 8 or more dies per NUMA node, and on newer systems, there may be something like 40 cores per die

Anyways, it seems clear that you have far too many DPCs and that what I’ll call the ‘signalization’ between threads dominates your performance issue. Probably it will work better if you do that ‘long’ calculation because it will set the CPUs doing useful and independent work.

But you can’t expect good overall performance if you set all of the CPUs to do anything at a high IRQL for any amount of time. They won’t be able to do anything else - including consuming the results of your super special calculations

The “physical die” constraint was to keep data as localized as possible between cores, L1 certainly, L2 if possible and L3 if you’re really lucky. Chips I worked with (apparently when stone was still dirt) had eight logical cores per physical die, had no idea you could get 40 logical cores on a die now … wow!

A quick search shows one example

https://www.intel.ca/content/www/ca/en/products/sku/212287/intel-xeon-platinum-8380-processor-60m-cache-2-30-ghz/specifications.html

on the device side, locality to a NUMA node is certainly important. But device memory is usually not cached, so cache locality doesn’t seem to matter so much. I would imagine it would be more about memory locality on the host side - whatever you are doing with the data after your ISR / DPC runs

1 Like

@MBond2 said:
That advise seems odd. One physical die is not a normal unit to constrain an algorithm to. Usually a processor or core or NUMA node or a processor group. NUMA nodes often align with physical dies but not necessarily. On older systems, there may be 8 or more dies per NUMA node, and on newer systems, there may be something like 40 cores per die

Anyways, it seems clear that you have far too many DPCs and that what I’ll call the ‘signalization’ between threads dominates your performance issue. Probably it will work better if you do that ‘long’ calculation because it will set the CPUs doing useful and independent work.

But you can’t expect good overall performance if you set all of the CPUs to do anything at a high IRQL for any amount of time. They won’t be able to do anything else - including consuming the results of your super special calculations

For now i decided to use next algorithm: theres a queue on each processor, i send segments with 8 packets on once processor, and when its completely full (~14 elements in queue), i start to use next processor. Btw i am not queuing dpc from default processor on target processor, while queue is always full, the dpc requeues itseld at the end of dpc (when the maximum number of segments has been processed). If queue is empty, the dpc do not requeue itself. Using this algorithm. I do not get a delay in speed without a long operation. But when the long operation executes, theres still no increase of speed… But in the task manager I can see the kernels are loaded evenly… Still looking for a solution…

Are you saying that your calculations on some packets are ‘short’ and others are ‘long’? Or that when you implement this scheme without doing your packet analysis calculations, the operations are ‘short’, but when you do your packet analysis calculations the operations are ‘long’?

in either case, you likely suffer from the bandwidth latency problem. Which is well documented. but the solutions for specific problems depend on how that latency impacts the higher level protocols that your code will affect. stream based protocols like those based on TCP dominate and per stream (octuple) based approaches are typical. but recently UDP has been rediscovered - as in the QUIK protocol for HTTP3. It turns out that the same approaches for TCP actually work for this too, except that the streams are ephemeral

the details can be hard and difficult to explain

> @MBond2 said: > Are you saying that your calculations on some packets are ‘short’ and others are ‘long’? Or that when you implement this scheme without doing your packet analysis calculations, the operations are ‘short’, but when you do your packet analysis calculations the operations are ‘long’? > > in either case, you likely suffer from the bandwidth latency problem. Which is well documented. but the solutions for specific problems depend on how that latency impacts the higher level protocols that your code will affect. stream based protocols like those based on TCP dominate and per stream (octuple) based approaches are typical. but recently UDP has been rediscovered - as in the QUIK protocol for HTTP3. It turns out that the same approaches for TCP actually work for this too, except that the streams are ephemeral > > the details can be hard and difficult to explain Actually, you are right, I found out that my alg works good for udp traffic, and do not increase speed for tcp packets (using iperf3). I see that when I send udp traffic, all my cores are loaded on 100%, but when it’s tcp than only one processor is loaded while others are idle, feeling that one of the processors is waiting for the others