Does foreced affinity increase performance

On a system with several processors might affinity increase performance of a single thread ?
According to Microsoft documentation when working on multiprocessor system(MP) the o.s. tries to give some affinity to threads automatically.
So it will try always to run a given thread on the same processor whenever possible, in order to benefit from local caches.

However, in practice I have observed that this ‘automatic’ affinity does not work much if at all.
For example, if one has a process with only one thread that is taking say 90% of the CPU time it will run faster if manual affinity is given to that thread.
If I let it run by default the thread will bounce between the available processors so wasting local cache.
By local cache in this context I mean ‘virtual memory cache’ , not physical, so every cache is wasted each time a context switch occur.

Is there any manual setting that can be done to fine tune the ‘automatic affinity’ of threads ?

Inaki.

One of the well-known recommendations for thread-pooling servers is to hard-attach the thread to some CPU to avoid cache thrashing.

Maxim Shatskih, Windows DDK MVP
StorageCraft Corporation
xxxxx@storagecraft.com
http://www.storagecraft.com

----- Original Message -----
From: I?aki Castillo
To: Windows System Software Devs Interest List
Sent: Wednesday, January 05, 2005 10:07 PM
Subject: [ntdev] Does foreced affinity increase performance

On a system with several processors might affinity increase performance of a single thread ?
According to Microsoft documentation when working on multiprocessor system(MP) the o.s. tries to give some affinity to threads automatically.
So it will try always to run a given thread on the same processor whenever possible, in order to benefit from local caches.

However, in practice I have observed that this ‘automatic’ affinity does not work much if at all.
For example, if one has a process with only one thread that is taking say 90% of the CPU time it will run faster if manual affinity is given to that thread.
If I let it run by default the thread will bounce between the available processors so wasting local cache.
By local cache in this context I mean ‘virtual memory cache’ , not physical, so every cache is wasted each time a context switch occur.

Is there any manual setting that can be done to fine tune the ‘automatic affinity’ of threads ?

Inaki.


Questions? First check the Kernel Driver FAQ at http://www.osronline.com/article.cfm?id=256

You are currently subscribed to ntdev as: unknown lmsubst tag argument: ‘’
To unsubscribe send a blank email to xxxxx@lists.osr.com

Disclaimer: I know very little about how NT works internally. I do however,
have some experience in trying to optimise code for dual processors, and
quite a bit of experience in how other OS’s work in this respect.

This is one of those “balance” questions. If you have two idle processors,
and start one thread that uses 90% CPU-time, you’d wish that thread to run
on a single CPU. However, because of the background work that happens even
on an idle system, sometimes both CPU’s are being busy doing some menial
task like counting up the time, swapping pages in and out, receiving
network packages that are broadcasts but uninteresting to us, etc, etc.

So, when both CPU’s are busy with other work, your thread is put in the
“runnable-but-waiting” list, and the first CPU to become available will
take this thread on as a runnable thread. For many purposes, this is the
best solution.

As you mention, there are, however, some situations where this is quite
wasteful, because we’ve just loaded the cache (either CPU cache, or some OS
per processor cache) with some useful data, and if we swap to the other
processor, the cache there does not contain anything useful for this
thread.

I’m not aware of any way to change the automatic affinity.

However, there is a drawback to setting the affinity mask manually, and
that is if someone else has also set the same processor to a higher
priority thread, you’d be waiting for this thread to finish, while the
other processor is sitting idle.

I think the best solution to the problem is to stay with the existing
scheme of letting the system decide which CPU to run on, as this proves to
be a better case than the “worst case” scenario.

If you still think you can outwit the system, you could try something like
this:

  • Create a thread for each (active) processor in the system, set affinity
    to a single processor, using a different processor for each thread.
  • Create a queue of “chunks of work”.
  • Use some form of event/semaphore to signal “new work arrived”.
  • Let all threads wait for the event.
  • When a thread wakes up, let it grab the first item in the queue. When
    finished, start waiting again.

The above method may not give the optimal performance for all packets (due
to other threads running at higher priority on a particular processor), but
it should allow you to reduce the amount of unnecessary cache-reloading.


Mats

xxxxx@lists.osr.com wrote on 01/05/2005 07:07:12 PM:

On a system with several processors might affinity increase
performance of a single thread ?
According to Microsoft documentation when working on multiprocessor
system(MP) the o.s. tries to give some affinity to threads automatically.
So it will try always to run a given thread on the same processor
whenever possible, in order to benefit from local caches.

However, in practice I have observed that this ‘automatic’ affinity
does not work much if at all.
For example, if one has a process with only one thread that is
taking say 90% of the CPU time it will run faster if manual affinity
is given to that thread.
If I let it run by default the thread will bounce between the
available processors so wasting local cache.
By local cache in this context I mean ‘virtual memory cache’ , not
physical, so every cache is wasted each time a context switch occur.

Is there any manual setting that can be done to fine tune the
‘automatic affinity’ of threads ?

Inaki.


Questions? First check the Kernel Driver FAQ at http://www.
osronline.com/article.cfm?id=256

You are currently subscribed to ntdev as: unknown lmsubst tag argument:
‘’
To unsubscribe send a blank email to xxxxx@lists.osr.com
ForwardSourceID:NT0000A49E

> If you still think you can outwit the system, you could try something

like
this:

  • Create a thread for each (active) processor in the system, set
    affinity
    to a single processor, using a different processor for each thread.
  • Create a queue of “chunks of work”.
  • Use some form of event/semaphore to signal “new work arrived”.
  • Let all threads wait for the event.
  • When a thread wakes up, let it grab the first item in the queue.
    When
    finished, start waiting again.

This is what I/O completion ports basically do, right?

Chuck