Differences between UP and MP systems?

I was looking at upgrading my home PC and was looking at the various
HyperThreading information. What surprised me is that a number of
benchmarks showed a degradation on systems where HT is enabled versus it
being disabled, up to 5% when it’s mentioned by the tester. The apparent
cause of this is the OS differences between the Uniprocessor (ntoskrnl.exe)
and Multiprocessor (ntkrnlmp.exe) kernels.

As far as I know, the only differences between UP and MP are in the
ntoskrnl/ntkrnlmp.exe kernel; I don’t think that any other files such as the
HAL need to change. The only difference that I could think of was with
SpinLocks, but I don’t see that accounting for a 5% difference. There’s
surely some difference in the task scheduler, but I wouldn’t think that’s
major either. I suspect that there are other differences, but I couldn’t
find any with my Google searching, so could someone educate me/us?

Thanks in advance!

> a number of

benchmarks showed a degradation on systems where HT is enabled versus it
being disabled, up to 5% when it’s mentioned by the tester. The apparent
cause of this is the OS differences between the Uniprocessor
(ntoskrnl.exe)
and Multiprocessor (ntkrnlmp.exe) kernels.

I don’t believe this, for the simple reason that if you use the MP kernel
on a true UP machine, the effect is much smaller.

As far as I know, the only differences between UP and MP are in the
ntoskrnl/ntkrnlmp.exe kernel; I don’t think that any other files such as
the HAL need to change.

The HAL does change, and a few others.

The only difference that I could think of was with
SpinLocks,

That’s pretty much it. In some places where spinlocks aren’t needed due
to low-level use of a LOCKed instruction, the MP version will have the
LOCK prefix and the UP version won’t. Things like that.

but I don’t see that accounting for a 5% difference. There’s
surely some difference in the task scheduler, but I wouldn’t think that’s
major either.

The scheduler in recent versions does do some work to be “HT aware,” but I
believe this really only matters if you have more than one physical CPU.
Scheduling on an HT system with one physical CPU isn’t any different (in
terms of results) than if you have two physical, non-HT CPUs. I don’t
know if the scheduler has optimizations for the single HT CPU case.

I suspect that there are other differences, but I couldn’t
find any with my Google searching, so could someone educate me/us?

I think it’s a CPU issue. The HT CPU’s firmware is doing extra work to
try to schedule work into the execution units. You also now have two
different threads fighting for one CPU’s worth of cache.

Jamie Hanrahan
Windows Internals, Drivers, and Debugging - Training and Consulting
http://www.cmkrnl.com/
http://www.azius.com/

Can be some complex inter-dependence inside Intel’s CPU pipelines.

For instance, Oracle once recommended to switch HT off on their server to
avoid performance losses.

Maxim Shatskih, Windows DDK MVP
StorageCraft Corporation
xxxxx@storagecraft.com
http://www.storagecraft.com

----- Original Message -----
From: “Taed Wynnell”
To: “Windows System Software Devs Interest List”
Sent: Thursday, September 09, 2004 7:47 PM
Subject: [ntdev] Differences between UP and MP systems?

> I was looking at upgrading my home PC and was looking at the various
> HyperThreading information. What surprised me is that a number of
> benchmarks showed a degradation on systems where HT is enabled versus it
> being disabled, up to 5% when it’s mentioned by the tester. The apparent
> cause of this is the OS differences between the Uniprocessor (ntoskrnl.exe)
> and Multiprocessor (ntkrnlmp.exe) kernels.
>
> As far as I know, the only differences between UP and MP are in the
> ntoskrnl/ntkrnlmp.exe kernel; I don’t think that any other files such as the
> HAL need to change. The only difference that I could think of was with
> SpinLocks, but I don’t see that accounting for a 5% difference. There’s
> surely some difference in the task scheduler, but I wouldn’t think that’s
> major either. I suspect that there are other differences, but I couldn’t
> find any with my Google searching, so could someone educate me/us?
>
> Thanks in advance!
>
>
> —
> Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256
>
> You are currently subscribed to ntdev as: xxxxx@storagecraft.com
> To unsubscribe send a blank email to xxxxx@lists.osr.com

First of all, I’d like to point out that some of the performance difference
may not come from differences in the kernel. It’s also true to say that the
performance may degrade from an increased pressure on caches and memory
subsystem.

If you have two threads reading from memory, they will interfere with each
other in the sense that the memory controller may have to access different
sections of memory for each thread (most likely, unless the threads are
just reading the same memory, in which case it should be in the cache).
This reading different blocks of memory will reduce the efficiency of the
memory controller by adding extra commands to be sent to the memory chips.
This can amount to much more than 5% of the memory performance, but I guess
the average benchmark isn’t only reading memory, so it’s highly likely that
this is not all that is different.

If you have two threads sharing the same cache, there will be a greater
likelihood that something needed soon is thrown out, because the threads
will read data into the cache that is given the space needed by the other
thread. Thrashing the cache can of course happen in single threading too,
but it’s increased by the fact that you have two different threads that may
not have any knowledge of the other one, and the cache isn’t any bigger
when running hyperthreading, so this may slow things down.

A further reason for a slow down might be that despite the advertised
effects of hyperthreading, two threads are actually using the processor
core LESS efficient than a single thread. This obviously depends very much
on what the actual benchmark is doing (not only benchmarks, of course, but
a 5% slowdown is very hard to percieve on a machine, generally, you need to
get at least 20% difference before we realize there’s a difference). This
would make most of the difference on small apps that don’t do much memory
accessing, and spend a lot of time on CPU-bound calculations. Memory bound
applications will suffer from the above two problems.

Also, as you mention, some of the kernel is different, primarily, it will
do LOCK prefixes on some of the memory accesses where one CPU has to know
it’s the only CPU to access this location, and I think someone mentioned
something about some SpinLock call being essentially a No-Op on the UP
kernel, whilst it’s “a real function” on the MP kernel.

Aside from the NTKERNXX, I believe HAL.DLL is also different depending on
which configuration.

My guess is that the major difference in performance would be caused by the
cache/memory issues I’ve mentioned, rather than by differences in the
kernel. But that would naturally depend a lot on what benchmarks are being
run too.

To measure the true difference between MP and UP kernel, you should be able
to switch off the HyperThreading and run the same benchmark in Single
processor mode, without re-installing the kernel. If there’s a noticable
difference, then my first three reasons are highly likely.


Mats

xxxxx@lists.osr.com wrote on 09/09/2004 04:47:18 PM:

I was looking at upgrading my home PC and was looking at the various
HyperThreading information. What surprised me is that a number of
benchmarks showed a degradation on systems where HT is enabled versus it
being disabled, up to 5% when it’s mentioned by the tester. The apparent
cause of this is the OS differences between the Uniprocessor
(ntoskrnl.exe)
and Multiprocessor (ntkrnlmp.exe) kernels.

As far as I know, the only differences between UP and MP are in the
ntoskrnl/ntkrnlmp.exe kernel; I don’t think that any other files such as
the
HAL need to change. The only difference that I could think of was with
SpinLocks, but I don’t see that accounting for a 5% difference. There’s
surely some difference in the task scheduler, but I wouldn’t think that’s
major either. I suspect that there are other differences, but I couldn’t
find any with my Google searching, so could someone educate me/us?

Thanks in advance!


Questions? First check the Kernel Driver FAQ at http://www.
osronline.com/article.cfm?id=256

You are currently subscribed to ntdev as: xxxxx@3dlabs.com
To unsubscribe send a blank email to xxxxx@lists.osr.com

ForwardSourceID:NT00003192

Besides all the previous comments, check what benchmarks under what systems
were run! I know of an early review of HT systems that indicated you would
see a 5% or more degradation, because the reviewed had seen it on Linux and
never tried anything with Windows! I worked for a company producing Linux
servers at the time (I was designing the Window implementation), they still
find it advisable to turn of HT for Linux, but turn it on for Windows.


Don Burn (MVP, Windows DDK)
Windows 2k/XP/2k3 Filesystem and Driver Consulting

“Taed Wynnell” wrote in message news:xxxxx@ntdev…
> I was looking at upgrading my home PC and was looking at the various
> HyperThreading information. What surprised me is that a number of
> benchmarks showed a degradation on systems where HT is enabled versus it
> being disabled, up to 5% when it’s mentioned by the tester. The apparent
> cause of this is the OS differences between the Uniprocessor
(ntoskrnl.exe)
> and Multiprocessor (ntkrnlmp.exe) kernels.
>
> As far as I know, the only differences between UP and MP are in the
> ntoskrnl/ntkrnlmp.exe kernel; I don’t think that any other files such as
the
> HAL need to change. The only difference that I could think of was with
> SpinLocks, but I don’t see that accounting for a 5% difference. There’s
> surely some difference in the task scheduler, but I wouldn’t think that’s
> major either. I suspect that there are other differences, but I couldn’t
> find any with my Google searching, so could someone educate me/us?
>
> Thanks in advance!
>
>

Hi!

I believe there are many things that will improve latency but decrease
efficiency
when going from UP to MP logical processors. AFAIK changing IRQL does cost
much more
with MP than with UP? The effect on caches etc. has already been
mentioned.

Concerning the scheduler, well, I really do hope that there’s code in XP
to deal
with some HT issues. Consider the following case…
Let’s say you have some normal-priority thread doing some lengthy
computation,
and one backgound-thread like xxxxx@home uses to do it’s stuff. No other
threads
doing real work except the “standard-stuff” which will I expect to < 1% of
the
available CPU time.
So on an UP system the normal-priority thread would run alone and get
nearly 100%
of the total computation-power. On a “true” MP system (2 physical CPUs)
the
normal-priority thread would get 100% of one CPU which would be all it can
use
anyway, and the low-priority thread would get most of the time on the
second CPU.
Now, if a non-HT-aware scheduler is used to run a HT machine, the case
would be
the same, only that the total computation-power isn’t near the 200% you
get from
2 physical CPUs, but rather ~140% (at least as advertised by intel…).
The normal-priority thread would end up getting only ~70% of the
computation-power
as compared to the non-HT case. Of course the “sum of all the work that
the CPU
gets done” would be more (or the same) as in the UP case, but nearly 50%
of it
would be “non-important” work… - the normal-priority thread would take
longer
to get it’s job done than with HT disabled.
A HP aware scheduler could disable HT (by simply HALTing one logical CPU)
if
there are 2 threads with different priorities, and only let the 2 logical
CPUs
run in the case of interrupts or 2 concurrent threads with the same
priority.
In my opinion that would be a far better thing than to just let my
screensaver
(or xxxxx@home or whatever) take away much of the nice computation-power
from the
compiler, renderer or whatever “hungry” singlethreaded application I might
have
running.

BTW: does anyone know if the scheduler in XP does that? And can anyone
tell me
it the scheduler in win2k sp4 is “HT aware”? I’m running my 2k-box with HT
disabled
for now…

Regards,
Paul Groke

Mats PETERSSON
Gesendet von: xxxxx@lists.osr.com
09.09.2004 18:10
Bitte antworten an “Windows System Software Devs Interest List”

An: “Windows System Software Devs Interest List”

Kopie:
Thema: Re: [ntdev] Differences between UP and MP systems?

First of all, I’d like to point out that some of the performance
difference
may not come from differences in the kernel. It’s also true to say that
the
performance may degrade from an increased pressure on caches and memory
subsystem.

If you have two threads reading from memory, they will interfere with each
other in the sense that the memory controller may have to access different
sections of memory for each thread (most likely, unless the threads are
just reading the same memory, in which case it should be in the cache).
This reading different blocks of memory will reduce the efficiency of the
memory controller by adding extra commands to be sent to the memory chips.
This can amount to much more than 5% of the memory performance, but I
guess
the average benchmark isn’t only reading memory, so it’s highly likely
that
this is not all that is different.

If you have two threads sharing the same cache, there will be a greater
likelihood that something needed soon is thrown out, because the threads
will read data into the cache that is given the space needed by the other
thread. Thrashing the cache can of course happen in single threading too,
but it’s increased by the fact that you have two different threads that
may
not have any knowledge of the other one, and the cache isn’t any bigger
when running hyperthreading, so this may slow things down.

A further reason for a slow down might be that despite the advertised
effects of hyperthreading, two threads are actually using the processor
core LESS efficient than a single thread. This obviously depends very much
on what the actual benchmark is doing (not only benchmarks, of course, but
a 5% slowdown is very hard to percieve on a machine, generally, you need
to
get at least 20% difference before we realize there’s a difference). This
would make most of the difference on small apps that don’t do much memory
accessing, and spend a lot of time on CPU-bound calculations. Memory bound
applications will suffer from the above two problems.

Also, as you mention, some of the kernel is different, primarily, it will
do LOCK prefixes on some of the memory accesses where one CPU has to know
it’s the only CPU to access this location, and I think someone mentioned
something about some SpinLock call being essentially a No-Op on the UP
kernel, whilst it’s “a real function” on the MP kernel.

Aside from the NTKERNXX, I believe HAL.DLL is also different depending on
which configuration.

My guess is that the major difference in performance would be caused by
the
cache/memory issues I’ve mentioned, rather than by differences in the
kernel. But that would naturally depend a lot on what benchmarks are being
run too.

To measure the true difference between MP and UP kernel, you should be
able
to switch off the HyperThreading and run the same benchmark in Single
processor mode, without re-installing the kernel. If there’s a noticable
difference, then my first three reasons are highly likely.


Mats

xxxxx@lists.osr.com wrote on 09/09/2004 04:47:18 PM:

> I was looking at upgrading my home PC and was looking at the various
> HyperThreading information. What surprised me is that a number of
> benchmarks showed a degradation on systems where HT is enabled versus it
> being disabled, up to 5% when it’s mentioned by the tester. The
apparent
> cause of this is the OS differences between the Uniprocessor
(ntoskrnl.exe)
> and Multiprocessor (ntkrnlmp.exe) kernels.
>
> As far as I know, the only differences between UP and MP are in the
> ntoskrnl/ntkrnlmp.exe kernel; I don’t think that any other files such as
the
> HAL need to change. The only difference that I could think of was with
> SpinLocks, but I don’t see that accounting for a 5% difference. There’s
> surely some difference in the task scheduler, but I wouldn’t think
that’s
> major either. I suspect that there are other differences, but I
couldn’t
> find any with my Google searching, so could someone educate me/us?
>
> Thanks in advance!
>
>
> —
> Questions? First check the Kernel Driver FAQ at http://www.
> osronline.com/article.cfm?id=256
>
> You are currently subscribed to ntdev as: xxxxx@3dlabs.com
> To unsubscribe send a blank email to xxxxx@lists.osr.com

> ForwardSourceID:NT00003192


Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256

You are currently subscribed to ntdev as: xxxxx@tab.at
To unsubscribe send a blank email to xxxxx@lists.osr.com

Please visit us: www.tab.at www.championsnet.net
www.silverball.com

> A HP aware scheduler could disable HT (by simply HALTing one

logical CPU) if there are 2 threads with different
priorities, and only let the 2 logical CPUs run in the case
of interrupts or 2 concurrent threads with the same priority.
[…]
BTW: does anyone know if the scheduler in XP does that?

No, it does not.

And
can anyone tell me it the scheduler in win2k sp4 is “HT
aware”?

Not at all.

— Jamie Hanrahan
Kernel Mode Systems http://www.cmkrnl.com/
Azius Developer Training http://www.azius.com/
Windows internals and drivers - consulting, training, and debugging

Paul,

I’m not sure exactly what you want the scheduler to do… The processor, in
HT mode, will ALWAYS run two threads at the same time. Of course, one of
those threads may be the IDLE thread, which does a “HALT”. But no matter
what you do, the processor will attempt to execute two threads, and there
is really no control for the scheduler that says “Give more priority to
Thread1” or some such. Both threads are given equal priority. So, there’s a
few possible options for the scheduler (and some variations):

  1. Schedule only one thread at a time. Hyperthreading is essentially
    meantingless.
  2. Always schedule one thread per logical processor, using hyperthreading
    to the maximum.
  3. Try to figure out what the behaviour of the thread is, and schedule
    accordingily (not sure what the rules for this would be).
  4. If a thread has higher priority than other runnable threads, then run
    only one, if they are equal priority, run two threads.
  5. If a thread has high priority, then schedule on it’s own, if it’s got
    low priority schedule togehter with other low priority threads.

Now, I believe #2 is the current method of scheduling. The only added
feature to take care of in a HT system is that two threads that are sharing
any form of data should be scheduled on the same physical processor, in the
case of multiple physical processors.

It’s very hard to figure out what is the right choice in the #3+ options
above. This is because a thread could well have high priority, but also be
using the memory inefficiently (not all code that has high priority is
written to make good use of the processor). #3 would be an ideal solution,
but then you’d have to use LOTS of different metrics (memory usage
efficiency, which execution units are used, etc, etc).

Note also that if you’re running xxxxx@home on the Float unit will be more
busy than the Integer unit, so a thread running mostly in cache, using the
integer unit would run quite well. Of course, SETI also uses quite a big
chunk of memory, so caches and memory controller will be more stressed than
if the other thread was the “IDLE” thread…

The whole point of hyperthreading is that there are two threads to run in
the processor, so the processor can do something useful when it’s blocked
waiting for a memory operation to finish (or something else that takes
time), and if you only schedule one thread at a time, that wouldn’t work…
If you don’t want this, turn off HT in the BIOS…


Mats

xxxxx@lists.osr.com wrote on 09/09/2004 06:47:24 PM:

Hi!

I believe there are many things that will improve latency but decrease
efficiency
when going from UP to MP logical processors. AFAIK changing IRQL does
cost
much more
with MP than with UP? The effect on caches etc. has already been
mentioned.

Concerning the scheduler, well, I really do hope that there’s code in XP
to deal
with some HT issues. Consider the following case…
Let’s say you have some normal-priority thread doing some lengthy
computation,
and one backgound-thread like xxxxx@home uses to do it’s stuff. No other
threads
doing real work except the “standard-stuff” which will I expect to < 1%
of
the
available CPU time.
So on an UP system the normal-priority thread would run alone and get
nearly 100%
of the total computation-power. On a “true” MP system (2 physical CPUs)
the
normal-priority thread would get 100% of one CPU which would be all it
can
use
anyway, and the low-priority thread would get most of the time on the
second CPU.
Now, if a non-HT-aware scheduler is used to run a HT machine, the case
would be
the same, only that the total computation-power isn’t near the 200% you
get from
2 physical CPUs, but rather ~140% (at least as advertised by intel…).
The normal-priority thread would end up getting only ~70% of the
computation-power
as compared to the non-HT case. Of course the “sum of all the work that
the CPU
gets done” would be more (or the same) as in the UP case, but nearly 50%
of it
would be “non-important” work… - the normal-priority thread would take
longer
to get it’s job done than with HT disabled.
A HP aware scheduler could disable HT (by simply HALTing one logical CPU)

if
there are 2 threads with different priorities, and only let the 2 logical

CPUs
run in the case of interrupts or 2 concurrent threads with the same
priority.
In my opinion that would be a far better thing than to just let my
screensaver
(or xxxxx@home or whatever) take away much of the nice computation-power
from the
compiler, renderer or whatever “hungry” singlethreaded application I
might
have
running.

BTW: does anyone know if the scheduler in XP does that? And can anyone
tell me
it the scheduler in win2k sp4 is “HT aware”? I’m running my 2k-box with
HT
disabled
for now…

Regards,
Paul Groke

Mats PETERSSON
> Gesendet von: xxxxx@lists.osr.com
> 09.09.2004 18:10
> Bitte antworten an “Windows System Software Devs Interest List”
>
> An: “Windows System Software Devs Interest List”
>
> Kopie:
> Thema: Re: [ntdev] Differences between UP and MP systems?
>
>
>
>
>
>
>
> First of all, I’d like to point out that some of the performance
> difference
> may not come from differences in the kernel. It’s also true to say that
> the
> performance may degrade from an increased pressure on caches and memory
> subsystem.
>
> If you have two threads reading from memory, they will interfere with
each
> other in the sense that the memory controller may have to access
different
> sections of memory for each thread (most likely, unless the threads are
> just reading the same memory, in which case it should be in the cache).
> This reading different blocks of memory will reduce the efficiency of the
> memory controller by adding extra commands to be sent to the memory
chips.
> This can amount to much more than 5% of the memory performance, but I
> guess
> the average benchmark isn’t only reading memory, so it’s highly likely
> that
> this is not all that is different.
>
> If you have two threads sharing the same cache, there will be a greater
> likelihood that something needed soon is thrown out, because the threads
> will read data into the cache that is given the space needed by the other
> thread. Thrashing the cache can of course happen in single threading too,
> but it’s increased by the fact that you have two different threads that
> may
> not have any knowledge of the other one, and the cache isn’t any bigger
> when running hyperthreading, so this may slow things down.
>
> A further reason for a slow down might be that despite the advertised
> effects of hyperthreading, two threads are actually using the processor
> core LESS efficient than a single thread. This obviously depends very
much
> on what the actual benchmark is doing (not only benchmarks, of course,
but
> a 5% slowdown is very hard to percieve on a machine, generally, you need
> to
> get at least 20% difference before we realize there’s a difference). This
> would make most of the difference on small apps that don’t do much memory
> accessing, and spend a lot of time on CPU-bound calculations. Memory
bound
> applications will suffer from the above two problems.
>
> Also, as you mention, some of the kernel is different, primarily, it will
> do LOCK prefixes on some of the memory accesses where one CPU has to know
> it’s the only CPU to access this location, and I think someone mentioned
> something about some SpinLock call being essentially a No-Op on the UP
> kernel, whilst it’s “a real function” on the MP kernel.
>
> Aside from the NTKERNXX, I believe HAL.DLL is also different depending on
> which configuration.
>
> My guess is that the major difference in performance would be caused by
> the
> cache/memory issues I’ve mentioned, rather than by differences in the
> kernel. But that would naturally depend a lot on what benchmarks are
being
> run too.
>
> To measure the true difference between MP and UP kernel, you should be
> able
> to switch off the HyperThreading and run the same benchmark in Single
> processor mode, without re-installing the kernel. If there’s a noticable
> difference, then my first three reasons are highly likely.
>
> –
> Mats
>
> xxxxx@lists.osr.com wrote on 09/09/2004 04:47:18 PM:
>
> > I was looking at upgrading my home PC and was looking at the various
> > HyperThreading information. What surprised me is that a number of
> > benchmarks showed a degradation on systems where HT is enabled versus
it
> > being disabled, up to 5% when it’s mentioned by the tester. The
> apparent
> > cause of this is the OS differences between the Uniprocessor
> (ntoskrnl.exe)
> > and Multiprocessor (ntkrnlmp.exe) kernels.
> >
> > As far as I know, the only differences between UP and MP are in the
> > ntoskrnl/ntkrnlmp.exe kernel; I don’t think that any other files such
as
> the
> > HAL need to change. The only difference that I could think of was with
> > SpinLocks, but I don’t see that accounting for a 5% difference.
There’s
> > surely some difference in the task scheduler, but I wouldn’t think
> that’s
> > major either. I suspect that there are other differences, but I
> couldn’t
> > find any with my Google searching, so could someone educate me/us?
> >
> > Thanks in advance!
> >
> >
> > —
> > Questions? First check the Kernel Driver FAQ at http://www.
> > osronline.com/article.cfm?id=256
> >
> > You are currently subscribed to ntdev as: xxxxx@3dlabs.com
> > To unsubscribe send a blank email to xxxxx@lists.osr.com
>
> > ForwardSourceID:NT00003192
>
>
> —
> Questions? First check the Kernel Driver FAQ at
> http://www.osronline.com/article.cfm?id=256
>
> You are currently subscribed to ntdev as: xxxxx@tab.at
> To unsubscribe send a blank email to xxxxx@lists.osr.com
>
> Please visit us: www.tab.at www.championsnet.net
> www.silverball.com
>
>
> —
> Questions? First check the Kernel Driver FAQ at http://www.
> osronline.com/article.cfm?id=256
>
> You are currently subscribed to ntdev as: xxxxx@3dlabs.com
> To unsubscribe send a blank email to xxxxx@lists.osr.com

> ForwardSourceID:NT000031F2

> From: “Mats PETERSSON”
> The processor, in HT mode, will ALWAYS run two threads at the same time.
[…]
> It’s very hard to figure out what is the right choice in the #3+ options
> above.

Yep, exactly.

Incidently, understand that future processors that support HT may
implement some sort of priority for the two hardware tasks. This would
allow the CPU microcode to have some awareness of the OS’s idea of
priority. But when you think about how traps are mostly handled in the NT
family (no task state switch) things get even worse. You wouldn’t want
the uCode to decide to pay little attention to one task, because that task
had low priority, and then have that be the task that ends up servicing an
interrupt or a DPC!

Whatever else you can say about HT, I will say this: It’s a great prod to
all the lazy hardware vendors, particularly in the “consumer” space, who
have been shipping non-MP-safe drivers. “We don’t need to test in the MP
environment because MP is for servers and our stuff is for end users,”
they say. Well, guess what, end users, gamers, etc., are now running on
HT systems, which demand drivers every bit as MP-safe as do “real” MP
systems.

Hah! I say again… HAH! :slight_smile: These folks are now forced to go back and
fix their code.

… I wish. A few have resorted to saying “you will have to disable HT to
use our stuff.” Of course, any vendor who tells you that should be
publically humiliated, right along with those who insist that their PCI
devices can’t tolerate shared “IRQs.”

Jamie Hanrahan
Windows Internals and Drivers Training and Consulting
http://www.cmkrnl.com/
http://www.azius.com/

Hi Mats,

I think out of your list #4 would be rather easy to implement and a good
option. Maybe not for all cases, but at least for some cases. I mean a
thread in logical CPU 1 will run faster if logical CPU 2 is “executing”
HALT. On a system with low-priority threads that do some cleanup and
housekeeping work but do always run, the latency for critical operations
could be improved. For most applications threads will be same priority so
HT would work well, and if there’s critical work to it would not be slowed
down just to execute some cleanup-thread that might as well run at a later
time.
#5 would be similar to #4 except that it would not let to threads with the
same high priority run parallel - I see no real benefit in that. Latency
for one of those threads would be better, but latency for the other thread
would be worse.
#3 finally would be very complicated and would have to be implemented
towards one specific and well-known CPU design. If Intel (in the case of
the P4) would change the caches or anything else in the CPU the solution
might end up being slower and less responsive than a non-HT-aware system.
Ah, yes, xxxxx@home was just an example - I actually ment any of those
“use-up-all-idle-CPU-time-for-non-important-stuff” programs.

The whole point of hyperthreading is that there are two threads to run
in
the processor, so the processor can do something useful when it’s
blocked
waiting for a memory operation to finish (or something else that takes
time), and if you only schedule one thread at a time, that wouldn’t
work…
If you don’t want this, turn off HT in the BIOS…

Yeah, I understand the point of HT, and I like parallel execution (even
without a better efficiency) for ISRs and stuff. I just don’t like the
idea that critical threads are slowed down by non-critical threads. I find
it somewhat strange that a thread running at very high priority runs
“full-speed” when there are no other threads running, and runs like 30%
slower if there is a very very low priority thread running too - it
shouldn’t do that, that’s what the whole priority-system is all about.

But it’s ok, I can accept the way it is implemented in XP :wink: - I just
wanted to point out that there are some real differences between a HT
system and a 2-physical-CPU system that could be accounted for in the
scheduler.

Have a nice day,

Paul

Mats PETERSSON
Gesendet von: xxxxx@lists.osr.com
10.09.2004 10:49
Bitte antworten an “Windows System Software Devs Interest List”

An: “Windows System Software Devs Interest List”

Kopie:
Thema: Re: [ntdev] Differences between UP and MP systems?

Paul,

I’m not sure exactly what you want the scheduler to do… The processor,
in
HT mode, will ALWAYS run two threads at the same time. Of course, one of
those threads may be the IDLE thread, which does a “HALT”. But no matter
what you do, the processor will attempt to execute two threads, and there
is really no control for the scheduler that says “Give more priority to
Thread1” or some such. Both threads are given equal priority. So, there’s
a
few possible options for the scheduler (and some variations):
1. Schedule only one thread at a time. Hyperthreading is essentially
meantingless.
2. Always schedule one thread per logical processor, using hyperthreading
to the maximum.
3. Try to figure out what the behaviour of the thread is, and schedule
accordingily (not sure what the rules for this would be).
4. If a thread has higher priority than other runnable threads, then run
only one, if they are equal priority, run two threads.
5. If a thread has high priority, then schedule on it’s own, if it’s got
low priority schedule togehter with other low priority threads.

Now, I believe #2 is the current method of scheduling. The only added
feature to take care of in a HT system is that two threads that are
sharing
any form of data should be scheduled on the same physical processor, in
the
case of multiple physical processors.

It’s very hard to figure out what is the right choice in the #3+ options
above. This is because a thread could well have high priority, but also be
using the memory inefficiently (not all code that has high priority is
written to make good use of the processor). #3 would be an ideal solution,
but then you’d have to use LOTS of different metrics (memory usage
efficiency, which execution units are used, etc, etc).

Note also that if you’re running xxxxx@home on the Float unit will be more
busy than the Integer unit, so a thread running mostly in cache, using the
integer unit would run quite well. Of course, SETI also uses quite a big
chunk of memory, so caches and memory controller will be more stressed
than
if the other thread was the “IDLE” thread…

The whole point of hyperthreading is that there are two threads to run in
the processor, so the processor can do something useful when it’s blocked
waiting for a memory operation to finish (or something else that takes
time), and if you only schedule one thread at a time, that wouldn’t
work…
If you don’t want this, turn off HT in the BIOS…


Mats

xxxxx@lists.osr.com wrote on 09/09/2004 06:47:24 PM:

> Hi!
>
> I believe there are many things that will improve latency but decrease
> efficiency
> when going from UP to MP logical processors. AFAIK changing IRQL does
cost
> much more
> with MP than with UP? The effect on caches etc. has already been
> mentioned.
>
> Concerning the scheduler, well, I really do hope that there’s code in XP
> to deal
> with some HT issues. Consider the following case…
> Let’s say you have some normal-priority thread doing some lengthy
> computation,
> and one backgound-thread like xxxxx@home uses to do it’s stuff. No other
> threads
> doing real work except the “standard-stuff” which will I expect to < 1%
of
> the
> available CPU time.
> So on an UP system the normal-priority thread would run alone and get
> nearly 100%
> of the total computation-power. On a “true” MP system (2 physical CPUs)
> the
> normal-priority thread would get 100% of one CPU which would be all it
can
> use
> anyway, and the low-priority thread would get most of the time on the
> second CPU.
> Now, if a non-HT-aware scheduler is used to run a HT machine, the case
> would be
> the same, only that the total computation-power isn’t near the 200% you
> get from
> 2 physical CPUs, but rather ~140% (at least as advertised by intel…).
> The normal-priority thread would end up getting only ~70% of the
> computation-power
> as compared to the non-HT case. Of course the “sum of all the work that
> the CPU
> gets done” would be more (or the same) as in the UP case, but nearly 50%
> of it
> would be “non-important” work… - the normal-priority thread would take
> longer
> to get it’s job done than with HT disabled.
> A HP aware scheduler could disable HT (by simply HALTing one logical
CPU)

> if
> there are 2 threads with different priorities, and only let the 2
logical

> CPUs
> run in the case of interrupts or 2 concurrent threads with the same
> priority.
> In my opinion that would be a far better thing than to just let my
> screensaver
> (or xxxxx@home or whatever) take away much of the nice computation-power
> from the
> compiler, renderer or whatever “hungry” singlethreaded application I
might
> have
> running.
>
> BTW: does anyone know if the scheduler in XP does that? And can anyone
> tell me
> it the scheduler in win2k sp4 is “HT aware”? I’m running my 2k-box with
HT
> disabled
> for now…
>
> Regards,
> Paul Groke
>
>
>
>
>
> Mats PETERSSON
> Gesendet von: xxxxx@lists.osr.com
> 09.09.2004 18:10
> Bitte antworten an “Windows System Software Devs Interest List”
>
> An: “Windows System Software Devs Interest List”
>
> Kopie:
> Thema: Re: [ntdev] Differences between UP and MP systems?
>
>
>
>
>
>
>
> First of all, I’d like to point out that some of the performance
> difference
> may not come from differences in the kernel. It’s also true to say that
> the
> performance may degrade from an increased pressure on caches and memory
> subsystem.
>
> If you have two threads reading from memory, they will interfere with
each
> other in the sense that the memory controller may have to access
different
> sections of memory for each thread (most likely, unless the threads are
> just reading the same memory, in which case it should be in the cache).
> This reading different blocks of memory will reduce the efficiency of
the
> memory controller by adding extra commands to be sent to the memory
chips.
> This can amount to much more than 5% of the memory performance, but I
> guess
> the average benchmark isn’t only reading memory, so it’s highly likely
> that
> this is not all that is different.
>
> If you have two threads sharing the same cache, there will be a greater
> likelihood that something needed soon is thrown out, because the threads
> will read data into the cache that is given the space needed by the
other
> thread. Thrashing the cache can of course happen in single threading
too,
> but it’s increased by the fact that you have two different threads that
> may
> not have any knowledge of the other one, and the cache isn’t any bigger
> when running hyperthreading, so this may slow things down.
>
> A further reason for a slow down might be that despite the advertised
> effects of hyperthreading, two threads are actually using the processor
> core LESS efficient than a single thread. This obviously depends very
much
> on what the actual benchmark is doing (not only benchmarks, of course,
but
> a 5% slowdown is very hard to percieve on a machine, generally, you need
> to
> get at least 20% difference before we realize there’s a difference).
This
> would make most of the difference on small apps that don’t do much
memory
> accessing, and spend a lot of time on CPU-bound calculations. Memory
bound
> applications will suffer from the above two problems.
>
> Also, as you mention, some of the kernel is different, primarily, it
will
> do LOCK prefixes on some of the memory accesses where one CPU has to
know
> it’s the only CPU to access this location, and I think someone mentioned
> something about some SpinLock call being essentially a No-Op on the UP
> kernel, whilst it’s “a real function” on the MP kernel.
>
> Aside from the NTKERNXX, I believe HAL.DLL is also different depending
on
> which configuration.
>
> My guess is that the major difference in performance would be caused by
> the
> cache/memory issues I’ve mentioned, rather than by differences in the
> kernel. But that would naturally depend a lot on what benchmarks are
being
> run too.
>
> To measure the true difference between MP and UP kernel, you should be
> able
> to switch off the HyperThreading and run the same benchmark in Single
> processor mode, without re-installing the kernel. If there’s a noticable
> difference, then my first three reasons are highly likely.
>
> –
> Mats
>
> xxxxx@lists.osr.com wrote on 09/09/2004 04:47:18 PM:
>
> > I was looking at upgrading my home PC and was looking at the various
> > HyperThreading information. What surprised me is that a number of
> > benchmarks showed a degradation on systems where HT is enabled versus
it
> > being disabled, up to 5% when it’s mentioned by the tester. The
> apparent
> > cause of this is the OS differences between the Uniprocessor
> (ntoskrnl.exe)
> > and Multiprocessor (ntkrnlmp.exe) kernels.
> >
> > As far as I know, the only differences between UP and MP are in the
> > ntoskrnl/ntkrnlmp.exe kernel; I don’t think that any other files such
as
> the
> > HAL need to change. The only difference that I could think of was
with
> > SpinLocks, but I don’t see that accounting for a 5% difference.
There’s
> > surely some difference in the task scheduler, but I wouldn’t think
> that’s
> > major either. I suspect that there are other differences, but I
> couldn’t
> > find any with my Google searching, so could someone educate me/us?
> >
> > Thanks in advance!
> >
> >
> > —
> > Questions? First check the Kernel Driver FAQ at http://www.
> > osronline.com/article.cfm?id=256
> >
> > You are currently subscribed to ntdev as: xxxxx@3dlabs.com
> > To unsubscribe send a blank email to xxxxx@lists.osr.com
>
> > ForwardSourceID:NT00003192
>
>
> —
> Questions? First check the Kernel Driver FAQ at
> http://www.osronline.com/article.cfm?id=256
>
> You are currently subscribed to ntdev as: xxxxx@tab.at
> To unsubscribe send a blank email to xxxxx@lists.osr.com
>
> Please visit us: www.tab.at www.championsnet.net
> www.silverball.com
>
>
> —
> Questions? First check the Kernel Driver FAQ at http://www.
> osronline.com/article.cfm?id=256
>
> You are currently subscribed to ntdev as: xxxxx@3dlabs.com
> To unsubscribe send a blank email to xxxxx@lists.osr.com

> ForwardSourceID:NT000031F2


Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256

You are currently subscribed to ntdev as: xxxxx@tab.at
To unsubscribe send a blank email to xxxxx@lists.osr.com

Please visit us: www.tab.at www.championsnet.net
www.silverball.com

A cpu/memory bound thread on physical processor 1 will run faster if
physical CPU 2, on which other threads could interfere with memory access,
is held idle. This sort of defeats the purpose of having multiple
processors. The real question for an MP system is how much useful work is
the system doing in total, not how much work is one thread doing.

Is there a general purpose MP OS out there that even considers not
scheduling threads on inactive processors?

=====================
Mark Roddy

-----Original Message-----
From: xxxxx@tab.at [mailto:xxxxx@tab.at]
Sent: Friday, September 10, 2004 6:48 AM
To: Windows System Software Devs Interest List
Subject: Re: [ntdev] Differences between UP and MP systems?

Hi Mats,

I think out of your list #4 would be rather easy to implement and a good
option. Maybe not for all cases, but at least for some cases. I mean a
thread in logical CPU 1 will run faster if logical CPU 2 is “executing”
HALT. On a system with low-priority threads that do some cleanup and
housekeeping work but do always run, the latency for critical operations
could be improved. For most applications threads will be same priority so HT
would work well, and if there’s critical work to it would not be slowed down
just to execute some cleanup-thread that might as well run at a later time.
#5 would be similar to #4 except that it would not let to threads with the
same high priority run parallel - I see no real benefit in that. Latency for
one of those threads would be better, but latency for the other thread would
be worse.
#3 finally would be very complicated and would have to be implemented
towards one specific and well-known CPU design. If Intel (in the case of the
P4) would change the caches or anything else in the CPU the solution might
end up being slower and less responsive than a non-HT-aware system.
Ah, yes, xxxxx@home was just an example - I actually ment any of those
“use-up-all-idle-CPU-time-for-non-important-stuff” programs.

The whole point of hyperthreading is that there are two threads to run
in
the processor, so the processor can do something useful when it’s
blocked
waiting for a memory operation to finish (or something else that takes
time), and if you only schedule one thread at a time, that wouldn’t
work…
If you don’t want this, turn off HT in the BIOS…

Yeah, I understand the point of HT, and I like parallel execution (even
without a better efficiency) for ISRs and stuff. I just don’t like the idea
that critical threads are slowed down by non-critical threads. I find it
somewhat strange that a thread running at very high priority runs
“full-speed” when there are no other threads running, and runs like 30%
slower if there is a very very low priority thread running too - it
shouldn’t do that, that’s what the whole priority-system is all about.

But it’s ok, I can accept the way it is implemented in XP :wink: - I just
wanted to point out that there are some real differences between a HT system
and a 2-physical-CPU system that could be accounted for in the scheduler.

Have a nice day,

Paul

Mats PETERSSON Gesendet von:
xxxxx@lists.osr.com
10.09.2004 10:49
Bitte antworten an “Windows System Software Devs Interest List”

An: “Windows System Software Devs Interest List”

Kopie:
Thema: Re: [ntdev] Differences between UP and MP systems?

Paul,

I’m not sure exactly what you want the scheduler to do… The processor, in
HT mode, will ALWAYS run two threads at the same time. Of course, one of
those threads may be the IDLE thread, which does a “HALT”. But no matter
what you do, the processor will attempt to execute two threads, and there is
really no control for the scheduler that says “Give more priority to
Thread1” or some such. Both threads are given equal priority. So, there’s a
few possible options for the scheduler (and some variations):
1. Schedule only one thread at a time. Hyperthreading is essentially
meantingless.
2. Always schedule one thread per logical processor, using hyperthreading to
the maximum.
3. Try to figure out what the behaviour of the thread is, and schedule
accordingily (not sure what the rules for this would be).
4. If a thread has higher priority than other runnable threads, then run
only one, if they are equal priority, run two threads.
5. If a thread has high priority, then schedule on it’s own, if it’s got low
priority schedule togehter with other low priority threads.

Now, I believe #2 is the current method of scheduling. The only added
feature to take care of in a HT system is that two threads that are sharing
any form of data should be scheduled on the same physical processor, in the
case of multiple physical processors.

It’s very hard to figure out what is the right choice in the #3+ options
above. This is because a thread could well have high priority, but also be
using the memory inefficiently (not all code that has high priority is
written to make good use of the processor). #3 would be an ideal solution,
but then you’d have to use LOTS of different metrics (memory usage
efficiency, which execution units are used, etc, etc).

Note also that if you’re running xxxxx@home on the Float unit will be more
busy than the Integer unit, so a thread running mostly in cache, using the
integer unit would run quite well. Of course, SETI also uses quite a big
chunk of memory, so caches and memory controller will be more stressed than
if the other thread was the “IDLE” thread…

The whole point of hyperthreading is that there are two threads to run in
the processor, so the processor can do something useful when it’s blocked
waiting for a memory operation to finish (or something else that takes
time), and if you only schedule one thread at a time, that wouldn’t work…
If you don’t want this, turn off HT in the BIOS…


Mats

xxxxx@lists.osr.com wrote on 09/09/2004 06:47:24 PM:

> Hi!
>
> I believe there are many things that will improve latency but decrease
> efficiency when going from UP to MP logical processors. AFAIK changing
> IRQL does
cost
> much more
> with MP than with UP? The effect on caches etc. has already been
> mentioned.
>
> Concerning the scheduler, well, I really do hope that there’s code in
> XP to deal with some HT issues. Consider the following case…
> Let’s say you have some normal-priority thread doing some lengthy
> computation, and one backgound-thread like xxxxx@home uses to do it’s
> stuff. No other threads doing real work except the “standard-stuff”
> which will I expect to < 1%
of
> the
> available CPU time.
> So on an UP system the normal-priority thread would run alone and get
> nearly 100% of the total computation-power. On a “true” MP system (2
> physical CPUs) the normal-priority thread would get 100% of one CPU
> which would be all it
can
> use
> anyway, and the low-priority thread would get most of the time on the
> second CPU.
> Now, if a non-HT-aware scheduler is used to run a HT machine, the case
> would be the same, only that the total computation-power isn’t near
> the 200% you get from
> 2 physical CPUs, but rather ~140% (at least as advertised by intel…).
> The normal-priority thread would end up getting only ~70% of the
> computation-power as compared to the non-HT case. Of course the “sum
> of all the work that the CPU gets done” would be more (or the same) as
> in the UP case, but nearly 50% of it would be “non-important” work…
> - the normal-priority thread would take longer to get it’s job done
> than with HT disabled.
> A HP aware scheduler could disable HT (by simply HALTing one logical
CPU)

> if
> there are 2 threads with different priorities, and only let the 2
logical

> CPUs
> run in the case of interrupts or 2 concurrent threads with the same
> priority.
> In my opinion that would be a far better thing than to just let my
> screensaver
> (or xxxxx@home or whatever) take away much of the nice computation-power
> from the
> compiler, renderer or whatever “hungry” singlethreaded application I
might
> have
> running.
>
> BTW: does anyone know if the scheduler in XP does that? And can anyone
> tell me
> it the scheduler in win2k sp4 is “HT aware”? I’m running my 2k-box with
HT
> disabled
> for now…
>
> Regards,
> Paul Groke
>
>
>
>
>
> Mats PETERSSON
> Gesendet von: xxxxx@lists.osr.com
> 09.09.2004 18:10
> Bitte antworten an “Windows System Software Devs Interest List”
>
> An: “Windows System Software Devs Interest List”
>
> Kopie:
> Thema: Re: [ntdev] Differences between UP and MP systems?
>
>
>
>
>
>
>
> First of all, I’d like to point out that some of the performance
> difference
> may not come from differences in the kernel. It’s also true to say that
> the
> performance may degrade from an increased pressure on caches and memory
> subsystem.
>
> If you have two threads reading from memory, they will interfere with
each
> other in the sense that the memory controller may have to access
different
> sections of memory for each thread (most likely, unless the threads are
> just reading the same memory, in which case it should be in the cache).
> This reading different blocks of memory will reduce the efficiency of
the
> memory controller by adding extra commands to be sent to the memory
chips.
> This can amount to much more than 5% of the memory performance, but I
> guess
> the average benchmark isn’t only reading memory, so it’s highly likely
> that
> this is not all that is different.
>
> If you have two threads sharing the same cache, there will be a greater
> likelihood that something needed soon is thrown out, because the threads
> will read data into the cache that is given the space needed by the
other
> thread. Thrashing the cache can of course happen in single threading
too,
> but it’s increased by the fact that you have two different threads that
> may
> not have any knowledge of the other one, and the cache isn’t any bigger
> when running hyperthreading, so this may slow things down.
>
> A further reason for a slow down might be that despite the advertised
> effects of hyperthreading, two threads are actually using the processor
> core LESS efficient than a single thread. This obviously depends very
much
> on what the actual benchmark is doing (not only benchmarks, of course,
but
> a 5% slowdown is very hard to percieve on a machine, generally, you need
> to
> get at least 20% difference before we realize there’s a difference).
This
> would make most of the difference on small apps that don’t do much
memory
> accessing, and spend a lot of time on CPU-bound calculations. Memory
bound
> applications will suffer from the above two problems.
>
> Also, as you mention, some of the kernel is different, primarily, it
will
> do LOCK prefixes on some of the memory accesses where one CPU has to
know
> it’s the only CPU to access this location, and I think someone mentioned
> something about some SpinLock call being essentially a No-Op on the UP
> kernel, whilst it’s “a real function” on the MP kernel.
>
> Aside from the NTKERNXX, I believe HAL.DLL is also different depending
on
> which configuration.
>
> My guess is that the major difference in performance would be caused by
> the
> cache/memory issues I’ve mentioned, rather than by differences in the
> kernel. But that would naturally depend a lot on what benchmarks are
being
> run too.
>
> To measure the true difference between MP and UP kernel, you should be
> able
> to switch off the HyperThreading and run the same benchmark in Single
> processor mode, without re-installing the kernel. If there’s a noticable
> difference, then my first three reasons are highly likely.
>
> –
> Mats
>
> xxxxx@lists.osr.com wrote on 09/09/2004 04:47:18 PM:
>
> > I was looking at upgrading my home PC and was looking at the various
> > HyperThreading information. What surprised me is that a number of
> > benchmarks showed a degradation on systems where HT is enabled versus
it
> > being disabled, up to 5% when it’s mentioned by the tester. The
> apparent
> > cause of this is the OS differences between the Uniprocessor
> (ntoskrnl.exe)
> > and Multiprocessor (ntkrnlmp.exe) kernels.
> >
> > As far as I know, the only differences between UP and MP are in the
> > ntoskrnl/ntkrnlmp.exe kernel; I don’t think that any other files such
as
> the
> > HAL need to change. The only difference that I could think of was
with
> > SpinLocks, but I don’t see that accounting for a 5% difference.
There’s
> > surely some difference in the task scheduler, but I wouldn’t think
> that’s
> > major either. I suspect that there are other differences, but I
> couldn’t
> > find any with my Google searching, so could someone educate me/us?
> >
> > Thanks in advance!
> >
> >
> > —
> > Questions? First check the Kernel Driver FAQ at http://www.
> > osronline.com/article.cfm?id=256
> >
> > You are currently subscribed to ntdev as: xxxxx@3dlabs.com
> > To unsubscribe send a blank email to xxxxx@lists.osr.com
>
> > ForwardSourceID:NT00003192
>
>
> —
> Questions? First check the Kernel Driver FAQ at
> http://www.osronline.com/article.cfm?id=256
>
> You are currently subscribed to ntdev as: xxxxx@tab.at
> To unsubscribe send a blank email to xxxxx@lists.osr.com
>
> Please visit us: www.tab.at www.championsnet.net
> www.silverball.com
>
>
> —
> Questions? First check the Kernel Driver FAQ at http://www.
> osronline.com/article.cfm?id=256
>
> You are currently subscribed to ntdev as: xxxxx@3dlabs.com
> To unsubscribe send a blank email to xxxxx@lists.osr.com

> ForwardSourceID:NT000031F2


Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256

You are currently subscribed to ntdev as: xxxxx@tab.at
To unsubscribe send a blank email to xxxxx@lists.osr.com

Please visit us: www.tab.at www.championsnet.net
www.silverball.com


Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256

You are currently subscribed to ntdev as: xxxxx@stratus.com
To unsubscribe send a blank email to xxxxx@lists.osr.com

> I think out of your list #4 would be rather easy to implement and a good

option. Maybe not for all cases, but at least for some cases. I mean a
thread in logical CPU 1 will run faster if logical CPU 2 is “executing”
HALT. On a system with low-priority threads that do some cleanup and

Intel suggests to insert the PAUSE opcode inside any busy loop, this will
relinguish the CPU core to the second sub-CPU. NT’s spinlocks already use this.

Maxim Shatskih, Windows DDK MVP
StorageCraft Corporation
xxxxx@storagecraft.com
http://www.storagecraft.com

xxxxx@lists.osr.com wrote on 09/10/2004 01:46:36 PM:

> I think out of your list #4 would be rather easy to implement and a
good
> option. Maybe not for all cases, but at least for some cases. I mean a
> thread in logical CPU 1 will run faster if logical CPU 2 is “executing”
> HALT. On a system with low-priority threads that do some cleanup and

Intel suggests to insert the PAUSE opcode inside any busy loop, this will
relinguish the CPU core to the second sub-CPU. NT’s spinlocks
already use this.

Yes, that’s a clever thing if you have a piece of code that does somethine
like:

while(!ready) // Ready will be set by hardware outside the
CPU.
{
timeout–;
if (!timeout)
{
error = ERR_TIMEOUT;
break;
}
// Insert PAUSE opcode here.
}

This will help the second logical CPU get some more cycles done whilst
we’re waiting for the hardware (for instance) to get done. And code that is
intended to run at a low priority could of course also do this, but it
would be fairly ineffecient to do it if there’s no other (high-priority)
thread running, so in the end, it’s still not really that helpfull unless
we’re actually wanting to PAUSE the cpu for a (fairly) large number of
cycles waiting for something external.

So this doesn’t really solve Paul’s concerns regarding scheduling as such
(because the scheduler can’t insert a Pause in the code when it needs
to)…


Mats

Well lot of big guns already spelled out quite a bit :-).

But the fundamental paradigm is that *whenever we do pseudo or real parallel
processing* there is overhead in getting them to synch, be it multithreaded
(UP) or mutiprocessors ( hT or else ). Usually a gross estimate is
Log(N)base 2. So
the question is really the benchmark would have to be extreemly pragmatic. I
can always comeup with logics that would flash cache, pipeline etc, thrashes
pages to make every single advances looks horrible to prove they are not
good. So essentially *WHAT IS THE BENCHMARK FOR*, HOW DID THEY PICK UP THE
SAMPLES etc are important too, I suppose. In a shared memory MP architecture
there would always be a concern about BuS locking, CPU (loigicl/phy)
yeilding etc, then there are memory that are 4 way 8 way grids so 8
simultanious memory access is possible, though not for locking, but possibly
clever design of parallelism, and yet we can blow that too if we want. Also
there are instructions reording, if we look up above then we can easily blow
that up too, and that would be a logical bug rather than performance bug
:-).

So in essence, HT machines would be there to stay, we like it or not :-).
Latest craze is MS media center, people like to see it as a true
entertainment hardware, and HT will surely play a big role in lots of houses

When I look at any benchmark, I asked myself is it Fad or Fact :-).

-pro

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com]On Behalf Of Mats PETERSSON
Sent: Friday, September 10, 2004 5:54 AM
To: Windows System Software Devs Interest List
Subject: Re: [ntdev] Differences between UP and MP systems?

xxxxx@lists.osr.com wrote on 09/10/2004 01:46:36 PM:

> I think out of your list #4 would be rather easy to implement and a
good
> option. Maybe not for all cases, but at least for some cases. I mean a
> thread in logical CPU 1 will run faster if logical CPU 2 is “executing”
> HALT. On a system with low-priority threads that do some cleanup and

Intel suggests to insert the PAUSE opcode inside any busy loop, this will
relinguish the CPU core to the second sub-CPU. NT’s spinlocks
already use this.

Yes, that’s a clever thing if you have a piece of code that does somethine
like:

while(!ready) // Ready will be set by hardware outside the
CPU.
{
timeout–;
if (!timeout)
{
error = ERR_TIMEOUT;
break;
}
// Insert PAUSE opcode here.
}

This will help the second logical CPU get some more cycles done whilst
we’re waiting for the hardware (for instance) to get done. And code that is
intended to run at a low priority could of course also do this, but it
would be fairly ineffecient to do it if there’s no other (high-priority)
thread running, so in the end, it’s still not really that helpfull unless
we’re actually wanting to PAUSE the cpu for a (fairly) large number of
cycles waiting for something external.

So this doesn’t really solve Paul’s concerns regarding scheduling as such
(because the scheduler can’t insert a Pause in the code when it needs
to)…


Mats


Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256

You are currently subscribed to ntdev as: xxxxx@garlic.com
To unsubscribe send a blank email to xxxxx@lists.osr.com

Ok, good point. Maybe this is less a HT issue than a general MP issue (at
least with shared memory) - didn’t think this through before. I admit I
missed an important point there - sorry about that.

The real question for an MP system is how much useful work is
the system doing in total, not how much work is one thread doing.

I agree, I just wanted to emphasise the “useful” in that statement. Some
things will run that aren’t alwasy useful - for exapmle garbage collectors
might run more often than necessary. Of course one could design such “idle
time eaters” so that they check at least if their own process is doing any
“real” work before eating “free” cpu-time. But I also think it’s something
the OS could do (like with a priority-level that means “only execute if
the whole system is idle” instead of “only execute if one CPU is idle”) so
any developer who needs a similar mechanism in one of it’s programs
wouldn’t have to reinvent the wheel, and also it would work system-wide,
not just process-wide.

And of course one wouldn’t have to stop there - other resources (like
disk-access) could benefit from a priority-mechanism too :wink: A flag like
“execute only when the disk is idle”… But then again thats way off
topic.

But guys, I also don’t want to annoy you - if you think it’s not worth
talking about we can just cut it here - it’s not all that important to me

  • more like a thing I find interesting.

Regards,

Paul Groke

“Roddy, Mark”
Gesendet von: xxxxx@lists.osr.com
10.09.2004 14:22
Bitte antworten an “Windows System Software Devs Interest List”

An: “Windows System Software Devs Interest List”

Kopie:
Thema: RE: [ntdev] Differences between UP and MP systems?

A cpu/memory bound thread on physical processor 1 will run faster if
physical CPU 2, on which other threads could interfere with memory access,
is held idle. This sort of defeats the purpose of having multiple
processors. The real question for an MP system is how much useful work is
the system doing in total, not how much work is one thread doing.

Is there a general purpose MP OS out there that even considers not
scheduling threads on inactive processors?

=====================
Mark Roddy

-----Original Message-----
From: xxxxx@tab.at [mailto:xxxxx@tab.at]
Sent: Friday, September 10, 2004 6:48 AM
To: Windows System Software Devs Interest List
Subject: Re: [ntdev] Differences between UP and MP systems?

Hi Mats,

I think out of your list #4 would be rather easy to implement and a good
option. Maybe not for all cases, but at least for some cases. I mean a
thread in logical CPU 1 will run faster if logical CPU 2 is “executing”
HALT. On a system with low-priority threads that do some cleanup and
housekeeping work but do always run, the latency for critical operations
could be improved. For most applications threads will be same priority so
HT
would work well, and if there’s critical work to it would not be slowed
down
just to execute some cleanup-thread that might as well run at a later
time.
#5 would be similar to #4 except that it would not let to threads with the
same high priority run parallel - I see no real benefit in that. Latency
for
one of those threads would be better, but latency for the other thread
would
be worse.
#3 finally would be very complicated and would have to be implemented
towards one specific and well-known CPU design. If Intel (in the case of
the
P4) would change the caches or anything else in the CPU the solution might
end up being slower and less responsive than a non-HT-aware system.
Ah, yes, xxxxx@home was just an example - I actually ment any of those
“use-up-all-idle-CPU-time-for-non-important-stuff” programs.

> The whole point of hyperthreading is that there are two threads to run
in
> the processor, so the processor can do something useful when it’s
blocked
> waiting for a memory operation to finish (or something else that takes
> time), and if you only schedule one thread at a time, that wouldn’t
work…
> If you don’t want this, turn off HT in the BIOS…

Yeah, I understand the point of HT, and I like parallel execution (even
without a better efficiency) for ISRs and stuff. I just don’t like the
idea
that critical threads are slowed down by non-critical threads. I find it
somewhat strange that a thread running at very high priority runs
“full-speed” when there are no other threads running, and runs like 30%
slower if there is a very very low priority thread running too - it
shouldn’t do that, that’s what the whole priority-system is all about.

But it’s ok, I can accept the way it is implemented in XP :wink: - I just
wanted to point out that there are some real differences between a HT
system
and a 2-physical-CPU system that could be accounted for in the
scheduler.

Have a nice day,

Paul

Mats PETERSSON Gesendet von:
xxxxx@lists.osr.com
10.09.2004 10:49
Bitte antworten an “Windows System Software Devs Interest List”

An: “Windows System Software Devs Interest List”

Kopie:
Thema: Re: [ntdev] Differences between UP and MP systems?

Paul,

I’m not sure exactly what you want the scheduler to do… The processor,
in
HT mode, will ALWAYS run two threads at the same time. Of course, one of
those threads may be the IDLE thread, which does a “HALT”. But no matter
what you do, the processor will attempt to execute two threads, and there
is
really no control for the scheduler that says “Give more priority to
Thread1” or some such. Both threads are given equal priority. So, there’s
a
few possible options for the scheduler (and some variations):
1. Schedule only one thread at a time. Hyperthreading is essentially
meantingless.
2. Always schedule one thread per logical processor, using hyperthreading
to
the maximum.
3. Try to figure out what the behaviour of the thread is, and schedule
accordingily (not sure what the rules for this would be).
4. If a thread has higher priority than other runnable threads, then run
only one, if they are equal priority, run two threads.
5. If a thread has high priority, then schedule on it’s own, if it’s got
low
priority schedule togehter with other low priority threads.

Now, I believe #2 is the current method of scheduling. The only added
feature to take care of in a HT system is that two threads that are
sharing
any form of data should be scheduled on the same physical processor, in
the
case of multiple physical processors.

It’s very hard to figure out what is the right choice in the #3+ options
above. This is because a thread could well have high priority, but also be
using the memory inefficiently (not all code that has high priority is
written to make good use of the processor). #3 would be an ideal solution,
but then you’d have to use LOTS of different metrics (memory usage
efficiency, which execution units are used, etc, etc).

Note also that if you’re running xxxxx@home on the Float unit will be more
busy than the Integer unit, so a thread running mostly in cache, using the
integer unit would run quite well. Of course, SETI also uses quite a big
chunk of memory, so caches and memory controller will be more stressed
than
if the other thread was the “IDLE” thread…

The whole point of hyperthreading is that there are two threads to run in
the processor, so the processor can do something useful when it’s blocked
waiting for a memory operation to finish (or something else that takes
time), and if you only schedule one thread at a time, that wouldn’t
work…
If you don’t want this, turn off HT in the BIOS…


Mats

xxxxx@lists.osr.com wrote on 09/09/2004 06:47:24 PM:

> Hi!
>
> I believe there are many things that will improve latency but decrease
> efficiency when going from UP to MP logical processors. AFAIK changing
> IRQL does
cost
> much more
> with MP than with UP? The effect on caches etc. has already been
> mentioned.
>
> Concerning the scheduler, well, I really do hope that there’s code in
> XP to deal with some HT issues. Consider the following case…
> Let’s say you have some normal-priority thread doing some lengthy
> computation, and one backgound-thread like xxxxx@home uses to do it’s
> stuff. No other threads doing real work except the “standard-stuff”
> which will I expect to < 1%
of
> the
> available CPU time.
> So on an UP system the normal-priority thread would run alone and get
> nearly 100% of the total computation-power. On a “true” MP system (2
> physical CPUs) the normal-priority thread would get 100% of one CPU
> which would be all it
can
> use
> anyway, and the low-priority thread would get most of the time on the
> second CPU.
> Now, if a non-HT-aware scheduler is used to run a HT machine, the case
> would be the same, only that the total computation-power isn’t near
> the 200% you get from
> 2 physical CPUs, but rather ~140% (at least as advertised by intel…).
> The normal-priority thread would end up getting only ~70% of the
> computation-power as compared to the non-HT case. Of course the “sum
> of all the work that the CPU gets done” would be more (or the same) as
> in the UP case, but nearly 50% of it would be “non-important” work…
> - the normal-priority thread would take longer to get it’s job done
> than with HT disabled.
> A HP aware scheduler could disable HT (by simply HALTing one logical
CPU)

> if
> there are 2 threads with different priorities, and only let the 2
logical

> CPUs
> run in the case of interrupts or 2 concurrent threads with the same
> priority.
> In my opinion that would be a far better thing than to just let my
> screensaver
> (or xxxxx@home or whatever) take away much of the nice computation-power
> from the
> compiler, renderer or whatever “hungry” singlethreaded application I
might
> have
> running.
>
> BTW: does anyone know if the scheduler in XP does that? And can anyone
> tell me
> it the scheduler in win2k sp4 is “HT aware”? I’m running my 2k-box with
HT
> disabled
> for now…
>
> Regards,
> Paul Groke
>
>
>
>
>
> Mats PETERSSON
> Gesendet von: xxxxx@lists.osr.com
> 09.09.2004 18:10
> Bitte antworten an “Windows System Software Devs Interest List”
>
> An: “Windows System Software Devs Interest List”
>
> Kopie:
> Thema: Re: [ntdev] Differences between UP and MP systems?
>
>
>
>
>
>
>
> First of all, I’d like to point out that some of the performance
> difference
> may not come from differences in the kernel. It’s also true to say that
> the
> performance may degrade from an increased pressure on caches and memory
> subsystem.
>
> If you have two threads reading from memory, they will interfere with
each
> other in the sense that the memory controller may have to access
different
> sections of memory for each thread (most likely, unless the threads are
> just reading the same memory, in which case it should be in the cache).
> This reading different blocks of memory will reduce the efficiency of
the
> memory controller by adding extra commands to be sent to the memory
chips.
> This can amount to much more than 5% of the memory performance, but I
> guess
> the average benchmark isn’t only reading memory, so it’s highly likely
> that
> this is not all that is different.
>
> If you have two threads sharing the same cache, there will be a greater
> likelihood that something needed soon is thrown out, because the threads
> will read data into the cache that is given the space needed by the
other
> thread. Thrashing the cache can of course happen in single threading
too,
> but it’s increased by the fact that you have two different threads that
> may
> not have any knowledge of the other one, and the cache isn’t any bigger
> when running hyperthreading, so this may slow things down.
>
> A further reason for a slow down might be that despite the advertised
> effects of hyperthreading, two threads are actually using the processor
> core LESS efficient than a single thread. This obviously depends very
much
> on what the actual benchmark is doing (not only benchmarks, of course,
but
> a 5% slowdown is very hard to percieve on a machine, generally, you need
> to
> get at least 20% difference before we realize there’s a difference).
This
> would make most of the difference on small apps that don’t do much
memory
> accessing, and spend a lot of time on CPU-bound calculations. Memory
bound
> applications will suffer from the above two problems.
>
> Also, as you mention, some of the kernel is different, primarily, it
will
> do LOCK prefixes on some of the memory accesses where one CPU has to
know
> it’s the only CPU to access this location, and I think someone mentioned
> something about some SpinLock call being essentially a No-Op on the UP
> kernel, whilst it’s “a real function” on the MP kernel.
>
> Aside from the NTKERNXX, I believe HAL.DLL is also different depending
on
> which configuration.
>
> My guess is that the major difference in performance would be caused by
> the
> cache/memory issues I’ve mentioned, rather than by differences in the
> kernel. But that would naturally depend a lot on what benchmarks are
being
> run too.
>
> To measure the true difference between MP and UP kernel, you should be
> able
> to switch off the HyperThreading and run the same benchmark in Single
> processor mode, without re-installing the kernel. If there’s a noticable
> difference, then my first three reasons are highly likely.
>
> –
> Mats
>
> xxxxx@lists.osr.com wrote on 09/09/2004 04:47:18 PM:
>
> > I was looking at upgrading my home PC and was looking at the various
> > HyperThreading information. What surprised me is that a number of
> > benchmarks showed a degradation on systems where HT is enabled versus
it
> > being disabled, up to 5% when it’s mentioned by the tester. The
> apparent
> > cause of this is the OS differences between the Uniprocessor
> (ntoskrnl.exe)
> > and Multiprocessor (ntkrnlmp.exe) kernels.
> >
> > As far as I know, the only differences between UP and MP are in the
> > ntoskrnl/ntkrnlmp.exe kernel; I don’t think that any other files such
as
> the
> > HAL need to change. The only difference that I could think of was
with
> > SpinLocks, but I don’t see that accounting for a 5% difference.
There’s
> > surely some difference in the task scheduler, but I wouldn’t think
> that’s
> > major either. I suspect that there are other differences, but I
> couldn’t
> > find any with my Google searching, so could someone educate me/us?
> >
> > Thanks in advance!
> >
> >
> > —
> > Questions? First check the Kernel Driver FAQ at http://www.
> > osronline.com/article.cfm?id=256
> >
> > You are currently subscribed to ntdev as: xxxxx@3dlabs.com
> > To unsubscribe send a blank email to xxxxx@lists.osr.com
>
> > ForwardSourceID:NT00003192
>
>
> —
> Questions? First check the Kernel Driver FAQ at
> http://www.osronline.com/article.cfm?id=256
>
> You are currently subscribed to ntdev as: xxxxx@tab.at
> To unsubscribe send a blank email to xxxxx@lists.osr.com
>
> Please visit us: www.tab.at www.championsnet.net
> www.silverball.com
>
>
> —
> Questions? First check the Kernel Driver FAQ at http://www.
> osronline.com/article.cfm?id=256
>
> You are currently subscribed to ntdev as: xxxxx@3dlabs.com
> To unsubscribe send a blank email to xxxxx@lists.osr.com

> ForwardSourceID:NT000031F2


Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256

You are currently subscribed to ntdev as: xxxxx@tab.at
To unsubscribe send a blank email to xxxxx@lists.osr.com

Please visit us: www.tab.at www.championsnet.net
www.silverball.com


Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256

You are currently subscribed to ntdev as: xxxxx@stratus.com
To unsubscribe send a blank email to xxxxx@lists.osr.com


Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256

You are currently subscribed to ntdev as: xxxxx@tab.at
To unsubscribe send a blank email to xxxxx@lists.osr.com

Please visit us: www.tab.at www.championsnet.net
www.silverball.com