interlocked instructions perform better with more processors ?

Sorry to bring up this old long thread again but I am reading this article
from AMD. It talks about “lock-free” programming and CAS all of which appear
to be boiling down to using interlocked APIs or instructions with a lock
prefix.
http://ein.designreactor.com:8080/amd_devCentral/articlex.jsp?id=89
It says: “The lock-free implementation should also provide better
performance as your application runs more threads on hardware with
increasing numbers of processor cores”.

I just want to check with this list if this is really true, because I digged
up what Arlie Davis said about interlocked instructions:
“You’re also not taking into account what happens on all of the other
processors. As part of the bus locking protocol, all of the other
processors need to flush their pipelines. Considering that some modern
processors have instruction pipelines that have 30 or more stages in them,
that’s a huge cost. Multiply 30 times the number of processors (or now,
cores), add in all the cache flushing, and you begin to see where the impact
comes from.”

I want to understand this concept right so my question is, is the statement
from AMD misleading ?

/Daniel

Well, if you want to acquire a spinlock, you need *at least* one locked test-and-set anyway. Therefore, all Arlie’s statements about interlocked operations apply to spinlock acquisition as well, plus spinlock acquisition has some extra overhead, i.e. spinning in a loop.

In other words, everything depends on a particular situation - if you want to do interlocked operations in a loop, it is better to give up this idea right on the spot. However, as long as you use interlocked operations in appropriate situations, they may, indeed, give you some performance benefits, compared to using spinlocks

Anton Bassov

Thanks, it seems to me that lock-free versions of data structures would
require much more interlocked instructions than their counterparts which use
standard locking mechanisms. In general, the more I know about sync
mechanisms (such as their disadvantages in certain situations), the more
complicated things are becoming to use them right. Not so much in the kernel
where options are limited but a lot in usermode where there is much at my
disposal.

/Daniel

wrote in message news:xxxxx@ntdev…
> Well, if you want to acquire a spinlock, you need at least one locked
> test-and-set anyway. Therefore, all Arlie’s statements about interlocked
> operations apply to spinlock acquisition as well, plus spinlock
> acquisition has some extra overhead, i.e. spinning in a loop.
>
> In other words, everything depends on a particular situation - if you want
> to do interlocked operations in a loop, it is better to give up this idea
> right on the spot. However, as long as you use interlocked operations in
> appropriate situations, they may, indeed, give you some performance
> benefits, compared to using spinlocks
>
> Anton Bassov
>

Spinlocks require 2 interlocked operations per thread lock acquisition
release sequence. Any ‘lockless’ algorithm that uses ~2 per thread is a
win. The difficulty is when the lockless algorithm uses >2 per thread,
as then you have to quantify the cost of interlocked operations and
somehow compare that to the cpu-stalling cost of spinlocks. Generally
the lockless algorithms make this all very difficult by not having a
clear value for ‘number of interlocked operations per thread’, even if
you could come up with a number for the cost of each interlocked
instruction.

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Daniel Terhell
Sent: Tuesday, October 23, 2007 8:46 AM
To: Windows System Software Devs Interest List
Subject: Re:[ntdev] interlocked instructions perform better with more
processors ?

Thanks, it seems to me that lock-free versions of data structures would
require much more interlocked instructions than their counterparts which
use
standard locking mechanisms. In general, the more I know about sync
mechanisms (such as their disadvantages in certain situations), the
more
complicated things are becoming to use them right. Not so much in the
kernel
where options are limited but a lot in usermode where there is much at
my
disposal.

/Daniel

wrote in message news:xxxxx@ntdev…
> Well, if you want to acquire a spinlock, you need at least one
locked
> test-and-set anyway. Therefore, all Arlie’s statements about
interlocked
> operations apply to spinlock acquisition as well, plus spinlock
> acquisition has some extra overhead, i.e. spinning in a loop.
>
> In other words, everything depends on a particular situation - if you
want
> to do interlocked operations in a loop, it is better to give up this
idea
> right on the spot. However, as long as you use interlocked operations
in
> appropriate situations, they may, indeed, give you some performance
> benefits, compared to using spinlocks
>
> Anton Bassov
>


NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

What it should really tell you is that there’s no silver bullet for synchronization performance issues. Either you stall the processors in software or you stall them in hardware but someplace you have to make them stop.

Analyze your performance issue. If you have many threads (high contention), they just need to set one value (small data to manipulate) and they aren’t already holding another common lock then interlocks will probably work well but you don’t know until you measure.

If you have several values that you need to modify atomically, if you have low contention for a lock, or if the work you’re doing is already protected by a higher level lock then using locks is probably better. Again you don’t know for sure until you measure.

The bottom line is that lock-free is an option, but you don’t know if it’s the better option for sure until you measure. And I wouldn’t bother trying it unless you know you have a problem, which again you find out through measurement.

-p

-----Original Message-----
From: xxxxx@lists.osr.com [mailto:xxxxx@lists.osr.com] On Behalf Of Daniel Terhell
Sent: Tuesday, October 23, 2007 1:27 AM
To: Windows System Software Devs Interest List
Subject: [ntdev] interlocked instructions perform better with more processors ?

Sorry to bring up this old long thread again but I am reading this article
from AMD. It talks about “lock-free” programming and CAS all of which appear
to be boiling down to using interlocked APIs or instructions with a lock
prefix.
http://ein.designreactor.com:8080/amd_devCentral/articlex.jsp?id=89
It says: “The lock-free implementation should also provide better
performance as your application runs more threads on hardware with
increasing numbers of processor cores”.

I just want to check with this list if this is really true, because I digged
up what Arlie Davis said about interlocked instructions:
“You’re also not taking into account what happens on all of the other
processors. As part of the bus locking protocol, all of the other
processors need to flush their pipelines. Considering that some modern
processors have instruction pipelines that have 30 or more stages in them,
that’s a huge cost. Multiply 30 times the number of processors (or now,
cores), add in all the cache flushing, and you begin to see where the impact
comes from.”

I want to understand this concept right so my question is, is the statement
from AMD misleading ?

/Daniel


NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer

> it seems to me that lock-free versions of data structures would

require much more interlocked instructions than their counterparts which use
standard locking mechanisms.

Not necessarily -again, it depends on a particular situation. In general, you should not try to find
some magic solution that is optimal in any situation. Instead, you should judge things on case-by-case basis, and act accordingly - some approach that works fine in situation X may be unreasonable in situation Y, and vice versa…

Anton Bassov

> Spinlocks require 2 interlocked operations per thread lock acquisition release sequence.

Actually, release does not have to be interlocked, because execution path of a releasing thread
is always the same. This is why modification to a spinlock’s state upon releasing it can be done by a simple MOV instruction that cannot be used with LOCK prefix. However, when it comes to acquisition, it has to be done as a locked set-and-test in order to make sure that all processors on the machine have a consistent view of a spinlock…

Anton Bassov

Also keep in mind that ‘lock-free’ methods use ‘lock’ prefix for atomic
instructions. Atomic instructions can be expensive on modern CPUs and
SMP/NUMA systems. Sometimes algorithms that provide mutual exclusion without
any atomic instructions are faster. An example of such algorithm is Lamport’s
bakery algorithm:

http://en.wikipedia.org/wiki/Lamport’s_bakery_algorithm

Dmitriy Budko
VMware

-----Original Message-----
From: xxxxx@lists.osr.com [mailto:bounce-303954-
xxxxx@lists.osr.com] On Behalf Of Peter Wieland
Sent: Tuesday, October 23, 2007 1:09 PM
To: Windows System Software Devs Interest List
Subject: RE: [ntdev] interlocked instructions perform better with more
processors ?

What it should really tell you is that there’s no silver bullet for
synchronization performance issues. Either you stall the processors in
software or you stall them in hardware but someplace you have to make them
stop.

Analyze your performance issue. If you have many threads (high
contention), they just need to set one value (small data to manipulate)
and they aren’t already holding another common lock then interlocks will
probably work well but you don’t know until you measure.

If you have several values that you need to modify atomically, if you have
low contention for a lock, or if the work you’re doing is already
protected by a higher level lock then using locks is probably better.
Again you don’t know for sure until you measure.

The bottom line is that lock-free is an option, but you don’t know if it’s
the better option for sure until you measure. And I wouldn’t bother
trying it unless you know you have a problem, which again you find out
through measurement.

-p

Thanks, I know I can measure but I just wanted to understand the reasoning
in the article why this would work faster
if the number of processors grows because in the example given, the queue is
using a lock instruction on every iteration in the loop whereas a spinlock
likely would likely be placed outside, around the loop to guarantee
consistency. Maybe the author has not taken into consideration the price of
bus locking.

I know I can measure if it becomes a problem, the question is what do we
use for that. KeQueryPerformanceCounter API is said to “defeat its purpose
of returning very fine-grained time-stamp information”. Also there are APIs
like KeQueryRunTimeThread or GetThreadTimes but I learnt from this list that
these report distorted information because counters are not updated at every
context switch which means if a thread runs and yields immediately after
that, it is still reported to have used up its entire timeslice. This is why
I don’t need task manager, I will need something which uses a more accurate
way of measuring such as the new QueryThreadCycleTime API. The question is
what tool is making use of that ?

There are tools like VTune which do instrumentation but I am not sure how
they work exactly. Also obtaining the best “performance” is not always the
ultimate goal. An example is a situation where you have a real time
multimedia thread which cannot afford a sleep/wait or a usermode/kernelmode
switch. You want to avoid hickups for every price, that’s the only thing
that really matters which does not necessarily reflect the best overall
performance or CPU consumption.

/Daniel

“Peter Wieland” wrote in message
news:xxxxx@ntdev…
What it should really tell you is that there’s no silver bullet for
synchronization performance issues. Either you stall the processors in
software or you stall them in hardware but someplace you have to make them
stop.

Analyze your performance issue. If you have many threads (high contention),
they just need to set one value (small data to manipulate) and they aren’t
already holding another common lock then interlocks will probably work well
but you don’t know until you measure.

If you have several values that you need to modify atomically, if you have
low contention for a lock, or if the work you’re doing is already protected
by a higher level lock then using locks is probably better. Again you don’t
know for sure until you measure.

The bottom line is that lock-free is an option, but you don’t know if it’s
the better option for sure until you measure. And I wouldn’t bother trying
it unless you know you have a problem, which again you find out through
measurement.

-p

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Daniel Terhell
Sent: Tuesday, October 23, 2007 1:27 AM
To: Windows System Software Devs Interest List
Subject: [ntdev] interlocked instructions perform better with more
processors ?

Sorry to bring up this old long thread again but I am reading this article
from AMD. It talks about “lock-free” programming and CAS all of which appear
to be boiling down to using interlocked APIs or instructions with a lock
prefix.
http://ein.designreactor.com:8080/amd_devCentral/articlex.jsp?id=89
It says: “The lock-free implementation should also provide better
performance as your application runs more threads on hardware with
increasing numbers of processor cores”.

I just want to check with this list if this is really true, because I digged
up what Arlie Davis said about interlocked instructions:
“You’re also not taking into account what happens on all of the other
processors. As part of the bus locking protocol, all of the other
processors need to flush their pipelines. Considering that some modern
processors have instruction pipelines that have 30 or more stages in them,
that’s a huge cost. Multiply 30 times the number of processors (or now,
cores), add in all the cache flushing, and you begin to see where the impact
comes from.”

I want to understand this concept right so my question is, is the statement
from AMD misleading ?

/Daniel


NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

> An example is a situation where you have a real time multimedia thread which

cannot afford a sleep/wait or a usermode/kernelmode switch.

Such threads don’t exist…

As long as a thread runs at IRQL below DPC level (and all user-mode code runs below DPC level),
it is always a subject to context switches, regardless of thread priority. For example, imagine what happens if it causes a page fault - once page faults are processed synchronously, it has to wait for completion of IRP that Memory Manager sends to FSD, so that it will yeild the CPU…

Anton Bassov

wrote in message news:xxxxx@ntdev…
>> An example is a situation where you have a real time multimedia thread
>> which
>> cannot afford a sleep/wait or a usermode/kernelmode switch.
>
> Such threads don’t exist…
>
> As long as a thread runs at IRQL below DPC level (and all user-mode code
> runs below DPC level),
> it is always a subject to context switches, regardless of thread priority.
> For example, imagine what happens if it causes a page fault - once page
> faults are processed synchronously, it has to wait for completion of IRP
> that Memory Manager sends to FSD, so that it will yeild the CPU…
>

Well, they get scheduled and complete before their time slices are used up.
They are subject to context switches only in a theoretical sense in case
there are multiple threads such as these.

/Daniel

> Well, they get scheduled and complete before their time slices are used up.

… which still means that your target thread has to wait AT LEAST one clock tick ( under W2K and XP
thread’s time slice is used up in 2 clock ticks by default, because the default quantum value under these OS versions is 6)

They are subject to context switches only in a theoretical sense in case there are multiple
threads such as these.

It has nothing to do with the number of threads of the same priority - as long as higher-priority thread is in the waiting state, lower-priority ones are allowed to run until higher-priority thread’s wait is satisfied…

Anton Bassov

If you use test-and-test-and-set you don’t lock the bus except once per
spin. The technique is to loop on a pure test (not test and set), which only
reads the variable and doesn’t place undue load on the bus. Only if the test
succeeds you do a test-and-set instruction, which locks the bus once for the
one write. You can read about that technique in Gregory Andrews’s OS book.

As far as VTune, AFAIK it works by sampling: a thread interrupts every timer
tick and looks at the state of the CPUs, accumulating statistics as time
goes by. But I haven’t touched it for a while, so, I may be wrong!

Alberto.

----- Original Message -----
From: “Daniel Terhell”
Newsgroups: ntdev
To: “Windows System Software Devs Interest List”
Sent: Wednesday, October 24, 2007 4:55 AM
Subject: Re:[ntdev] interlocked instructions perform better with more
processors ?

> Thanks, I know I can measure but I just wanted to understand the reasoning
> in the article why this would work faster
> if the number of processors grows because in the example given, the queue
> is using a lock instruction on every iteration in the loop whereas a
> spinlock likely would likely be placed outside, around the loop to
> guarantee consistency. Maybe the author has not taken into consideration
> the price of bus locking.
>
> I know I can measure if it becomes a problem, the question is what do we
> use for that. KeQueryPerformanceCounter API is said to “defeat its purpose
> of returning very fine-grained time-stamp information”. Also there are
> APIs like KeQueryRunTimeThread or GetThreadTimes but I learnt from this
> list that these report distorted information because counters are not
> updated at every context switch which means if a thread runs and yields
> immediately after that, it is still reported to have used up its entire
> timeslice. This is why I don’t need task manager, I will need something
> which uses a more accurate way of measuring such as the new
> QueryThreadCycleTime API. The question is what tool is making use of that
> ?
>
> There are tools like VTune which do instrumentation but I am not sure how
> they work exactly. Also obtaining the best “performance” is not always the
> ultimate goal. An example is a situation where you have a real time
> multimedia thread which cannot afford a sleep/wait or a
> usermode/kernelmode switch. You want to avoid hickups for every price,
> that’s the only thing that really matters which does not necessarily
> reflect the best overall performance or CPU consumption.
>
> /Daniel
>
>
>
> “Peter Wieland” wrote in message
> news:xxxxx@ntdev…
> What it should really tell you is that there’s no silver bullet for
> synchronization performance issues. Either you stall the processors in
> software or you stall them in hardware but someplace you have to make them
> stop.
>
> Analyze your performance issue. If you have many threads (high
> contention), they just need to set one value (small data to manipulate)
> and they aren’t already holding another common lock then interlocks will
> probably work well but you don’t know until you measure.
>
> If you have several values that you need to modify atomically, if you have
> low contention for a lock, or if the work you’re doing is already
> protected by a higher level lock then using locks is probably better.
> Again you don’t know for sure until you measure.
>
> The bottom line is that lock-free is an option, but you don’t know if it’s
> the better option for sure until you measure. And I wouldn’t bother
> trying it unless you know you have a problem, which again you find out
> through measurement.
>
> -p
>
> -----Original Message-----
> From: xxxxx@lists.osr.com
> [mailto:xxxxx@lists.osr.com] On Behalf Of Daniel Terhell
> Sent: Tuesday, October 23, 2007 1:27 AM
> To: Windows System Software Devs Interest List
> Subject: [ntdev] interlocked instructions perform better with more
> processors ?
>
> Sorry to bring up this old long thread again but I am reading this article
> from AMD. It talks about “lock-free” programming and CAS all of which
> appear
> to be boiling down to using interlocked APIs or instructions with a lock
> prefix.
> http://ein.designreactor.com:8080/amd_devCentral/articlex.jsp?id=89
> It says: “The lock-free implementation should also provide better
> performance as your application runs more threads on hardware with
> increasing numbers of processor cores”.
>
> I just want to check with this list if this is really true, because I
> digged
> up what Arlie Davis said about interlocked instructions:
> “You’re also not taking into account what happens on all of the other
> processors. As part of the bus locking protocol, all of the other
> processors need to flush their pipelines. Considering that some modern
> processors have instruction pipelines that have 30 or more stages in them,
> that’s a huge cost. Multiply 30 times the number of processors (or now,
> cores), add in all the cache flushing, and you begin to see where the
> impact
> comes from.”
>
> I want to understand this concept right so my question is, is the
> statement
> from AMD misleading ?
>
> /Daniel
>
>
>
> —
> NTDEV is sponsored by OSR
>
> For our schedule of WDF, WDM, debugging and other seminars visit:
> http://www.osr.com/seminars
>
> To unsubscribe, visit the List Server section of OSR Online at
> http://www.osronline.com/page.cfm?name=ListServer
>
>
>
>
> —
> NTDEV is sponsored by OSR
>
> For our schedule of WDF, WDM, debugging and other seminars visit:
> http://www.osr.com/seminars
>
> To unsubscribe, visit the List Server section of OSR Online at
> http://www.osronline.com/page.cfm?name=ListServer

> If you use test-and-test-and-set you don’t lock the bus except once per

spin. The technique is to loop on a pure test (not test and set), which only
reads the variable and doesn’t place undue load on the bus. Only if the test
succeeds you do a test-and-set instruction, which locks the bus once for the
one write.

Well, making *polling* loop interlocked as well would be really stupid. If you do something like that, then overhead of bus locking will, indeed, become quite noticeable. However, as long as you use interlocked operation only in an outer loop, the overhead is just negligible…

You can read about that technique in Gregory Andrews’s OS book.

Or, even better, just disassemble KfAcquireSpinlock() - this is exactly how it works. In fact, it still can be improved - KfAcquireSpinlock() does locked set-and-test *before* proceeding to the inner loop. Therefore, if someone holds a spinlock that you are trying to acquire, KfAcquireSpinlock() does one unnecessary(in my opinion) locked set-and-test. I would rather check spinlock state before
doing set-and-test. If you do it this way, you are quite unlikely to do more than one interlocked operation upon spinlock acquisition. I know that this improvement seems to be just marginal, but I believe in some cases it is not. Imagine if the target machine has, say, 8 CPUs, and spinlock in question is in high demand (for example, the one that protects a dispatcher database), so that you are quite unlikely to acquire it right from the very first attempt. In this case KfAcquireSpinlock() may do quite a few unnecessary, in the long run, interlocked operations that could be easily avoided by doing a “regular” check before proceeding to interlocked set-and-test…

Anton Bassov

This doesn’t have much to do with locking, but with the particular
characteristic of the test-and-set operation of writing the bit every time
the instruction gets executed. The lock prefix per-se doesn’t do much
performance damage, unless your locked operation is repeated over and again,
and even then, read cycles are less expensive than memory writes, because
memory writes will cause other processors to invalidate their own copies of
the written cache line, which will lead to additional traffic on the bus
when those processors try to read the information again. Picture yourself on
a Dell 490 with four hyperthreaded processors putting the equivalent of
eight processors on the front-side bus, all spinning at 3Ghz on a
test-and-set instruction over an 800Mhz front-side bus, and you’ll get the
right picture.

As far as disassembling, my advise is, don’t. I find it way better to rely
on one’s own wits. I have been around for a while, and the only time I ever
disassembled Windows code was when I needed to hook it - and even then, I
only disassembled the first few instructions, and only for the sake of the
instructions themselves and not for what they were doing!

Actually, I trust my own hand better than anyone else’s, even if that person
has been one of the writers of Windows code.

Alberto.

----- Original Message -----
From:
To: “Windows System Software Devs Interest List”
Sent: Wednesday, October 24, 2007 11:46 PM
Subject: RE:[ntdev] interlocked instructions perform better with more
processors ?

>> If you use test-and-test-and-set you don’t lock the bus except once per
>> spin. The technique is to loop on a pure test (not test and set), which
>> only
>> reads the variable and doesn’t place undue load on the bus. Only if the
>> test
>> succeeds you do a test-and-set instruction, which locks the bus once for
>> the
>> one write.
>
> Well, making polling loop interlocked as well would be really stupid. If
> you do something like that, then overhead of bus locking will, indeed,
> become quite noticeable. However, as long as you use interlocked operation
> only in an outer loop, the overhead is just negligible…
>
>> You can read about that technique in Gregory Andrews’s OS book.
>
> Or, even better, just disassemble KfAcquireSpinlock() - this is exactly
> how it works. In fact, it still can be improved - KfAcquireSpinlock() does
> locked set-and-test before proceeding to the inner loop. Therefore, if
> someone holds a spinlock that you are trying to acquire,
> KfAcquireSpinlock() does one unnecessary(in my opinion) locked
> set-and-test. I would rather check spinlock state before
> doing set-and-test. If you do it this way, you are quite unlikely to do
> more than one interlocked operation upon spinlock acquisition. I know that
> this improvement seems to be just marginal, but I believe in some cases it
> is not. Imagine if the target machine has, say, 8 CPUs, and spinlock in
> question is in high demand (for example, the one that protects a
> dispatcher database), so that you are quite unlikely to acquire it right
> from the very first attempt. In this case KfAcquireSpinlock() may do quite
> a few unnecessary, in the long run, interlocked operations that could be
> easily avoided by doing a “regular” check before proceeding to interlocked
> set-and-test…
>
> Anton Bassov
>
>
>
> —
> NTDEV is sponsored by OSR
>
> For our schedule of WDF, WDM, debugging and other seminars visit:
> http://www.osr.com/seminars
>
> To unsubscribe, visit the List Server section of OSR Online at
> http://www.osronline.com/page.cfm?name=ListServer

> read cycles are less expensive than memory writes
A pro po and off topic (I asked this before, maybe this time someone who knows is around…)

In Intel’s manuals I found an explicit statement that memory reads are atomic
(in my previous post I included the link).

Has anyone seen anything like that in_AMDs_own_writing about AMDs?

-------------- Original message --------------
From: “Alberto Moreira”

> This doesn’t have much to do with locking, but with the particular
> characteristic of the test-and-set operation of writing the bit every time
> the instruction gets executed. The lock prefix per-se doesn’t do much
> performance damage, unless your locked operation is repeated over and again,
> and even then, read cycles are less expensive than memory writes, because
> memory writes will cause other processors to invalidate their own copies of
> the written cache line, which will lead to additional traffic on the bus
> when those processors try to read the information again. Picture yourself on
> a Dell 490 with four hyperthreaded processors putting the equivalent of
> eight processors on the front-side bus, all spinning at 3Ghz on a
> test-and-set instruction over an 800Mhz front-side bus, and you’ll get the
> right picture.
>
> As far as disassembling, my advise is, don’t. I find it way better to rely
> on one’s own wits. I have been around for a while, and the only time I ever
> disassembled Windows code was when I needed to hook it - and even then, I
> only disassembled the first few instructions, and only for the sake of the
> instructions themselves and not for what they were doing!
>
> Actually, I trust my own hand better than anyone else’s, even if that person
> has been one of the writers of Windows code.
>
>
> Alberto.
>
>
>
> ----- Original Message -----
> From:
> To: “Windows System Software Devs Interest List”
> Sent: Wednesday, October 24, 2007 11:46 PM
> Subject: RE:[ntdev] interlocked instructions perform better with more
> processors ?
>
>
> >> If you use test-and-test-and-set you don’t lock the bus except once per
> >> spin. The technique is to loop on a pure test (not test and set), which
> >> only
> >> reads the variable and doesn’t place undue load on the bus. Only if the
> >> test
> >> succeeds you do a test-and-set instruction, which locks the bus once for
> >> the
> >> one write.
> >
> > Well, making polling loop interlocked as well would be really stupid. If
> > you do something like that, then overhead of bus locking will, indeed,
> > become quite noticeable. However, as long as you use interlocked operation
> > only in an outer loop, the overhead is just negligible…
> >
> >> You can read about that technique in Gregory Andrews’s OS book.
> >
> > Or, even better, just disassemble KfAcquireSpinlock() - this is exactly
> > how it works. In fact, it still can be improved - KfAcquireSpinlock() does
> > locked set-and-test before proceeding to the inner loop. Therefore, if
> > someone holds a spinlock that you are trying to acquire,
> > KfAcquireSpinlock() does one unnecessary(in my opinion) locked
> > set-and-test. I would rather check spinlock state before
> > doing set-and-test. If you do it this way, you are quite unlikely to do
> > more than one interlocked operation upon spinlock acquisition. I know that
> > this improvement seems to be just marginal, but I believe in some cases it
> > is not. Imagine if the target machine has, say, 8 CPUs, and spinlock in
> > question is in high demand (for example, the one that protects a
> > dispatcher database), so that you are quite unlikely to acquire it right
> > from the very first attempt. In this case KfAcquireSpinlock() may do quite
> > a few unnecessary, in the long run, interlocked operations that could be
> > easily avoided by doing a “regular” check before proceeding to interlocked
> > set-and-test…
> >
> > Anton Bassov
> >
> >
> >
> > —
> > NTDEV is sponsored by OSR
> >
> > For our schedule of WDF, WDM, debugging and other seminars visit:
> > http://www.osr.com/seminars
> >
> > To unsubscribe, visit the List Server section of OSR Online at
> > http://www.osronline.com/page.cfm?name=ListServer
>
>
> —
> NTDEV is sponsored by OSR
>
> For our schedule of WDF, WDM, debugging and other seminars visit:
> http://www.osr.com/seminars
>
> To unsubscribe, visit the List Server section of OSR Online at
> http://www.osronline.com/page.cfm?name=ListServer

> The lock prefix per-se doesn’t do much

performance damage, unless your locked operation is repeated over and again,

This is why I said that my proposed improvement seems to be marginal (in fact, in most cases it is).
However, again, imagine a spinlock that protects a dispatcher database - this spinlock gets accessed upon any state modification of any dispatcher object, i.e. upon every context switch, every wait, every modification of event state and so on. On a busy system with, say, 8 CPUs it is going to be almost always held by someone at any particular moment, and, while it is being help, few attempts to acquire it are more than likely to be made by other CPUs. More on it below

and even then, read cycles are less expensive than memory writes,because
memory writes will cause other processors to invalidate their own copies of
the written cache line, which will lead to additional traffic on the bus
when those processors try to read the information again.

I am afraid you confuse bus locking with atomicity. There is no such thing as interlocked read or interlocked write - LOCK prefix can be used only with some certain intruction like INC, DEC, CMPXCGH, BST, etc, and all these instructions result in modification of the target variable’s state
(in fact, the very concept of interlocked operation that does not modify state of the target variable just does not make sense it itself). Therefore, every interlocked operation that is made by CPU X on some variable is going to invalidate cache lines of all CPUs that attempt to read the state of this variable. This is the only reason why I said that if spinlock is always in high demand, my proposed modification may have some overall performance benefits, although in the vast majority of cases this improvement is just negligible…

As far as disassembling, my advise is, don’t. I find it way better to rely on one’s own wits.

Actually, I trust my own hand better than anyone else’s, even if that person
has been one of the writers of Windows code.

You should realize that every piece of ntoskrnl.exe’s and HAL.DLL’s code is based upon the collective research that was made by numerous people, and not only at MSFT. If you disassemble this code, you are going to understand not only *what* it does, but also *why* it does things this way. Therefore, I believe disassembling Windows kernel is just a wonderfull educational experience
that helps you better understand the underlying concepts. At the same time I see what you mean -
indeed, you should not take it as something absolute…

Anton Bassov

Well just to quibble, xchg does an effective locked read/write operation, and you can do an effective locked read as well, although why one would want to do a locked read is indeed a mystery.

-----Original Message-----
From: xxxxx@lists.osr.com [mailto:bounce-304242-
xxxxx@lists.osr.com] On Behalf Of xxxxx@hotmail.com
Sent: Thursday, October 25, 2007 4:15 PM
To: Windows System Software Devs Interest List
Subject: RE:[ntdev] interlocked instructions perform better with more
processors ?

> The lock prefix per-se doesn’t do much
> performance damage, unless your locked operation is repeated over and
again,

This is why I said that my proposed improvement seems to be marginal
(in fact, in most cases it is).
However, again, imagine a spinlock that protects a dispatcher database

  • this spinlock gets accessed upon any state modification of any
    dispatcher object, i.e. upon every context switch, every wait, every
    modification of event state and so on. On a busy system with, say, 8
    CPUs it is going to be almost always held by someone at any particular
    moment, and, while it is being help, few attempts to acquire it are
    more than likely to be made by other CPUs. More on it below

> and even then, read cycles are less expensive than memory
writes,because
> memory writes will cause other processors to invalidate their own
copies of
> the written cache line, which will lead to additional traffic on the
bus
> when those processors try to read the information again.

I am afraid you confuse bus locking with atomicity. There is no such
thing as interlocked read or interlocked write - LOCK prefix can be
used only with some certain intruction like INC, DEC, CMPXCGH, BST,
etc, and all these instructions result in modification of the target
variable’s state
(in fact, the very concept of interlocked operation that does not
modify state of the target variable just does not make sense it
itself). Therefore, every interlocked operation that is made by CPU X
on some variable is going to invalidate cache lines of all CPUs that
attempt to read the state of this variable. This is the only reason why
I said that if spinlock is always in high demand, my proposed
modification may have some overall performance benefits, although in
the vast majority of cases this improvement is just negligible…

> As far as disassembling, my advise is, don’t. I find it way better to
rely on one’s own wits.

> Actually, I trust my own hand better than anyone else’s, even if that
person
> has been one of the writers of Windows code.

You should realize that every piece of ntoskrnl.exe’s and HAL.DLL’s
code is based upon the collective research that was made by numerous
people, and not only at MSFT. If you disassemble this code, you are
going to understand not only *what* it does, but also *why* it does
things this way. Therefore, I believe disassembling Windows kernel is
just a wonderfull educational experience
that helps you better understand the underlying concepts. At the same
time I see what you mean -
indeed, you should not take it as something absolute…

Anton Bassov


NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

The lock prefix “atomicizes” the instruction by locking the front-side bus
between the read and the write accesses. Lock prefixes do not operate at
processor level or at memory level, they operate on the front-side bus.
Instructions like inc and dec rely on a read-modify-write sequence, which
involves two bus cycles: a read and a write. If you don’t use the lock
prefix, you can get one or more bus cycles between the read and the write,
with the potential for a race condition. Again, I recommend Mindshare’s
book, even though it’s written for hw dudes, it’s a great resource!

Actually, as an aside, I heard that there are buses out there that enable a
processor to ship a modified cache line directly to a read requestor: no
need to even access memory. But I don’t have specific data or reliable
information to share.

Alberto.

----- Original Message -----
From:
To: “Windows System Software Devs Interest List”
Sent: Thursday, October 25, 2007 4:14 PM
Subject: RE:[ntdev] interlocked instructions perform better with more
processors ?

>> The lock prefix per-se doesn’t do much
>> performance damage, unless your locked operation is repeated over and
>> again,
>
> This is why I said that my proposed improvement seems to be marginal (in
> fact, in most cases it is).
> However, again, imagine a spinlock that protects a dispatcher database -
> this spinlock gets accessed upon any state modification of any dispatcher
> object, i.e. upon every context switch, every wait, every modification of
> event state and so on. On a busy system with, say, 8 CPUs it is going to
> be almost always held by someone at any particular moment, and, while it
> is being help, few attempts to acquire it are more than likely to be made
> by other CPUs. More on it below
>
>
>> and even then, read cycles are less expensive than memory writes,because
>> memory writes will cause other processors to invalidate their own copies
>> of
>> the written cache line, which will lead to additional traffic on the bus
>> when those processors try to read the information again.
>
>
> I am afraid you confuse bus locking with atomicity. There is no such thing
> as interlocked read or interlocked write - LOCK prefix can be used only
> with some certain intruction like INC, DEC, CMPXCGH, BST, etc, and all
> these instructions result in modification of the target variable’s state
> (in fact, the very concept of interlocked operation that does not modify
> state of the target variable just does not make sense it itself).
> Therefore, every interlocked operation that is made by CPU X on some
> variable is going to invalidate cache lines of all CPUs that attempt to
> read the state of this variable. This is the only reason why I said that
> if spinlock is always in high demand, my proposed modification may have
> some overall performance benefits, although in the vast majority of cases
> this improvement is just negligible…
>
>> As far as disassembling, my advise is, don’t. I find it way better to
>> rely on one’s own wits.
> …
>
>> Actually, I trust my own hand better than anyone else’s, even if that
>> person
>> has been one of the writers of Windows code.
>
>
> You should realize that every piece of ntoskrnl.exe’s and HAL.DLL’s code
> is based upon the collective research that was made by numerous people,
> and not only at MSFT. If you disassemble this code, you are going to
> understand not only what it does, but also why it does things this
> way. Therefore, I believe disassembling Windows kernel is just a
> wonderfull educational experience
> that helps you better understand the underlying concepts. At the same time
> I see what you mean -
> indeed, you should not take it as something absolute…
>
> Anton Bassov
>
>
>
>
>
> —
> NTDEV is sponsored by OSR
>
> For our schedule of WDF, WDM, debugging and other seminars visit:
> http://www.osr.com/seminars
>
> To unsubscribe, visit the List Server section of OSR Online at
> http://www.osronline.com/page.cfm?name=ListServer

Alberto,

… ship a modified cache line directly to a read requestor …

Is that another way to say “Snooping” where the cache simply watches for
write operations on the (shared) memory bus and proactively updates cache
lines? I thought that was a common feature of the current platforms
(perhaps excluding NUMA and other exotics).

I sure don’t need to know this but I am just trying to keep up playing the
home game and was curious.

-dave

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Alberto Moreira
Sent: Thursday, October 25, 2007 9:13 PM
To: Windows System Software Devs Interest List
Subject: Re: RE:[ntdev] interlocked instructions perform better with more
processors ?

The lock prefix “atomicizes” the instruction by locking the front-side bus
between the read and the write accesses. Lock prefixes do not operate at
processor level or at memory level, they operate on the front-side bus.
Instructions like inc and dec rely on a read-modify-write sequence, which
involves two bus cycles: a read and a write. If you don’t use the lock
prefix, you can get one or more bus cycles between the read and the write,
with the potential for a race condition. Again, I recommend Mindshare’s
book, even though it’s written for hw dudes, it’s a great resource!

Actually, as an aside, I heard that there are buses out there that enable a
processor to ship a modified cache line directly to a read requestor: no
need to even access memory. But I don’t have specific data or reliable
information to share.

Alberto.