SetThreadAffinityMask and IO

If an application thread (user mode) invokes SetThreadAffinityMask, and subsequently makes calls to DeviceIOControl on a driver (kernel mode) which has no specified thread affinity, is the thread guaranteed to maintain its processor assignment?

The application I have in mind is processing (in parallel) data from multiple identical, simultaneously-sampling devices. Each device is handled by a separate thread. Ideally, the user interface thread would be assigned to yet another processor.

Here’s the problem I found in trying to answer this question. Most (maybe all?) of the literature on parallel processing concentrates on the processing, and ignores I/O. I suspect that I/O behaves differently, since it involves physical connections to the processor(s).

I tried a simple-minded experiment to see if there would be any improvement in the maximum data throughput using thread affinities, as opposed to just letting the OS schedule everything. In the experiment (Intel Core2 Quad CPU, WinXT, 2 acquisition devices/threads, 1 GUI thread) the devices are USB digitizers (our company product and driver) acquiring bursty data (~30 Mbits/s, in bursts of ~1ms on, ~1ms off). But the task manager is telling me that my thread affinity assignments are being ignored, so the experimental result is meaningless.

Thanks in advance for any constructive advice. I’m just getting started on this project and don’t want to waste too much time on a wild goose chase.

Hmmmm… I’m not sure what you’re asking.

If you’re asking “Is the requesting user-mode thread guaranteed to maintain its processor assignment” then the answer is certainly yes. You set an affinity, you get that affinity. Making a system service call (except one that would request a change in affinity) won’t change that.

If you’re asking “will the driver’s processing of the I/O operation occur on the same processor as that to which the thread is affinitized” the answer is “it depends”… but it is almost certainly no.

Does your device generate interrupts? What’s the I/O Model? What type of device is it?

Peter
OSR

  1. When a thread returns from a syscall, it keeps the same affinity (unless it’s a request to change the affinity).
  2. While in the kernel during a syscall, a driver may set a different affinity for a thread, but few if any drivers really care.
  3. Actual processing and completion of an IRP may happen on different processors, depending on interrupt handling and other considerations. You don’t have any control over that.
  4. Setting a thread affinity is unlikely to improve response time. It will only improve cache locality for your data.

Given that your device is USB, you likely have absolutely no control over
where & how IRPs are completed, but for further background you should look
at MSIx and targeted interrupts.

Note that on modern hardware, the only time you care about this is when
using NUMA systems; and even then, they need to be ‘large’ before the
effects of non-node local IO affect typical workloads. IMHO Windows does
not currently provide sufficient NUMA support to do much about this anyway,
so you are best off leaving it all up to the scheduler

wrote in message news:xxxxx@ntdev…

If an application thread (user mode) invokes SetThreadAffinityMask, and
subsequently makes calls to DeviceIOControl on a driver (kernel mode) which
has no specified thread affinity, is the thread guaranteed to maintain its
processor assignment?

The application I have in mind is processing (in parallel) data from
multiple identical, simultaneously-sampling devices. Each device is handled
by a separate thread. Ideally, the user interface thread would be assigned
to yet another processor.

Here’s the problem I found in trying to answer this question. Most (maybe
all?) of the literature on parallel processing concentrates on the
processing, and ignores I/O. I suspect that I/O behaves differently,
since it involves physical connections to the processor(s).

I tried a simple-minded experiment to see if there would be any improvement
in the maximum data throughput using thread affinities, as opposed to just
letting the OS schedule everything. In the experiment (Intel Core2 Quad
CPU, WinXT, 2 acquisition devices/threads, 1 GUI thread) the devices are USB
digitizers (our company product and driver) acquiring bursty data (~30
Mbits/s, in bursts of ~1ms on, ~1ms off). But the task manager is telling
me that my thread affinity assignments are being ignored, so the
experimental result is meaningless.

Thanks in advance for any constructive advice. I’m just getting started on
this project and don’t want to waste too much time on a wild goose chase.

> If an application thread (user mode) invokes SetThreadAffinityMask, and

subsequently makes calls to DeviceIOControl on a driver (kernel mode)
which has no specified thread affinity, is the thread guaranteed to
maintain its processor assignment?

A driver has no thread affinity because in general this is a meaningless
concept for a driver. The top-level dispatch routine will be executed in
the context of the calling thread, and as such is constrained by whatever
that thread’s affinity is set to. The ISR executes on whatever core the
interrupt is directed to, and this can vary with the basic architecture,
the support chip choice and/or the motherboard manufacturer. In any case
it is not under your control. The DPC executes, typically, on the same
core that the interrupt was processed on. You can direct the DPC to
another core, but then you must ask “which one”, which means you have to
know the thread affinity of the thread that issued the IRP.

This is all sounding like a need to find a solution to a nonexistent problem.

The application I have in mind is processing (in parallel) data from
multiple identical, simultaneously-sampling devices. Each device is
handled by a separate thread. Ideally, the user interface thread would be
assigned to yet another processor.

I used this technique for one client: if they didn’t run their video
threads at priority 15, they lost data. But if they ran the background
threads at 15, the GUI was essentially dead. So I asked them if three
cores could handle the data acquisition. They thought yes, so we built a
version of their app where we bound the GUI to affinity mask 0x00000001
and the secondary threads to (processor_mask & ~0x00000001). Worked like
a charm. Key here is letting the scheduler handle all the other cores.

Why do you need a separate thread to handle each device? This sounds
really strange. A better solution is to use async I/O wwith an I/O
Completion Port and a thread pool to handle the completions. For a device
like this, using synchronous I/O results in needless overheads and
complexity.

Here’s the problem I found in trying to answer this question. Most (maybe
all?) of the literature on parallel processing concentrates on the
processing, and ignores I/O. I suspect that I/O behaves differently,
since it involves physical connections to the processor(s).

Look at database literature. It is almost always I/O bottlenecked, so
they want a lot of high-bandwidth concurrent I/O but parallel processing
in the classic sense rarely helps performance. You were searching for the
wrong topic. And yes, to take advantage of concurrent I/O you need to
know the device-to-memory-bus mappings (Intel, for example, supports
multiple busses to memory, while AMD supports NUMA, a topic far too large
to go into here)

But be aware: if you optimize your work on one particular architecture, it
may be pessimized for another. Even if you say “But we’re selling turnkey
systems”, two years from now when you order your next 100 machines, the
vendor may have changed them, and the board architecture you so carefully
optimized for is simply no longer available.

I tried a simple-minded experiment to see if there would be any
improvement in the maximum data throughput using thread affinities, as
opposed to just letting the OS schedule everything. In the experiment
(Intel Core2 Quad CPU, WinXT, 2 acquisition devices/threads, 1 GUI thread)
the devices are USB digitizers (our company product and driver) acquiring
bursty data (~30 Mbits/s, in bursts of ~1ms on, ~1ms off). But the task
manager is telling me that my thread affinity assignments are being
ignored, so the experimental result is meaningless.

It tells you this how? I’ve got demo code that shows that affinities are
obeyed. Well, at least they were on XP. Which OS are you using? Or did
you say “WinXT” when you meant to type “WinXP”?

Thanks in advance for any constructive advice. I’m just getting started on
this project and don’t want to waste too much time on a wild goose chase.


NTDEV is sponsored by OSR

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

Setting a highly-restrictive thread affinity (e.g., to a single core) will
almost certainly produce lower effective throughput than letting the
scheduler manage the threads

Dedicating one thread per device, instead of using a thread pool to
process completions, will almost certainly reduce performance if (# of
devices) > (# of cores)

Having a thread pool with more threads than cores MIGHT result in
decreased performance, unless the threads can block (e.g., doing disk I/O,
etc.) In the case where threads can block, the limit on active threads
for the IOCP can be set to a larger number than the number of cores. The
thread pool on an IOCP can have more threads than cores, and the IOCP
deals with this if oe of the threads has blocked

Dividing the world into affinity classes MIGHT improve throughput, but
only if you get the RIGHT classes. A “class” would typically contain more
than one core. This requires fairly intimate understanding of device
affinities, NUMA affinities, etc. (For some programs that illustrate this,
check www.flounder.com/mvp_tips.htm and search for “NUMA”)

Creating the simple case where the GUI lives in a one-core class and all
work is done in other cores guarantees a responsive GUI, and is the one
exception to the idea of mapping a thread to a unique core.

Until it is demonstrated that there is an actual problem, I would urge you
to not worry too much about solving it.
joe

  1. When a thread returns from a syscall, it keeps the same affinity
    (unless it’s a request to change the affinity).
  2. While in the kernel during a syscall, a driver may set a different
    affinity for a thread, but few if any drivers really care.
  3. Actual processing and completion of an IRP may happen on different
    processors, depending on interrupt handling and other considerations. You
    don’t have any control over that.
  4. Setting a thread affinity is unlikely to improve response time. It will
    only improve cache locality for your data.

NTDEV is sponsored by OSR

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

Thanks to everyone for their thoughts and suggestions. I’ll look into thread pools, and database I/O. I particularly like Joseph’s thought and will try it:

where we bound the GUI to affinity mask 0x00000001
and the secondary threads to (processor_mask & ~0x00000001). Worked like
a charm. Key here is letting the scheduler handle all the other cores.

A couple of clarifications to the original post:

Or did you say “WinXT” when you meant to type “WinXP”?
Ouch! WinXP it is.

In the experiment, there are three threads (GUI and 2 workers) and 4 processor cores. Each thread is assigned to a different core, while the 4th core is not used. But Task Manager shows significant activity on the 4th core (sometimes more than the other three put together!). Most of the activity on all the cores is kernel-based (red line).

As for the “actual problem”, there is an upper limit to the data throughput we can reliably obtain using these devices. We wanted to see (empirically) if we could raise the limit.

If most activity is in kernel mode, then it doesn’t make sense to affinitize your user mode threads. In fact, having the threads on a single core may be beneficial for the spinlock and cache overhead.

And you are absolutely certain, beyond any shadow of a doubt, that there
are no other processes or kernel tbreads running? Consider te case I
cited. suppose they ran the video process only on cores 1and 2. This
leaves cores 0 and 3 available to run everything else in the system,
including most system service threads (priority 15 won’t inhibit kernel
threads, which typically run at priorities 16…31, from running on cores
1and 2). When you say you are “Seeing activity”, that is what you are
SUPPOSED to see in a multicore scheduler. What you didn’t say was “I see
activity on core 3 which is running some of my threads whose affinity does
not include core 3”. But now you say that all you saw was “activity”.
Why should this surprise you? On a lightly-loaded machine, one with no
apps running, I remember seeing close to 100 active threads.

Oh, yes, and what core was running Task Manager?
joe

Thanks to everyone for their thoughts and suggestions. I’ll look into
thread pools, and database I/O. I particularly like Joseph’s thought and
will try it:
>where we bound the GUI to affinity mask 0x00000001
>and the secondary threads to (processor_mask & ~0x00000001). Worked like
>a charm. Key here is letting the scheduler handle all the other cores.

A couple of clarifications to the original post:
>Or did you say “WinXT” when you meant to type “WinXP”?
Ouch! WinXP it is.

In the experiment, there are three threads (GUI and 2 workers) and 4
processor cores. Each thread is assigned to a different core, while the
4th core is not used. But Task Manager shows significant activity on the
4th core (sometimes more than the other three put together!). Most of the
activity on all the cores is kernel-based (red line).

As for the “actual problem”, there is an upper limit to the data
throughput we can reliably obtain using these devices. We wanted to see
(empirically) if we could raise the limit.


NTDEV is sponsored by OSR

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer