Re: Problem using queued spinlocks on single cpu comp uter

> ----------

From:
xxxxx@compuware.com[SMTP:xxxxx@compuware.com]
Reply To: xxxxx@lists.osr.com
Sent: Saturday, September 27, 2003 12:00 AM
To: xxxxx@lists.osr.com
Subject: [ntdev] Re: Problem using queued spinlocks on single cpu
comp uter

There are ways to handle one structure per acquisition that don’t require
a
new object per acquisition.

Sure. The hard way I mentioned, for example. Show me one which is as easy
and at least as efficient as using new object on the stack.

As for code such as

SpinLock.Lock();

if (something_went_wrong) {
throw Error;
}
SpinLock.Unlock();

whoever uses it is asking for it, no ? Yuk. That’s not what critical
sections were invented for !

Sure. This was only example which should show the problem blatantly obvious
way. You ignore the next one and following text. It can be my English or you
just don’t want to understand. In any case I give up.

Best regards,

Michal Vodicka
STMicroelectronics Design and Application s.r.o.
[michal.vodicka@st.com, http:://www.st.com]

Alberto,

Sorry if I annoyed you, my apologies.

OK, I wasn’t too annoyed; just felt you don’t want to hear what me and
others try to say.

I now understand that each processor must have its own lock structure, but
that’s easily handled. You could try for example something like

struct queuelocks
{
protected:
static KLOCK_QUEUE_HANDLE ql[MaxNumberOfProcessors];
public:
queuelocks() { }
};

struct spinlock : public queuelocks
{
protected:
KSPIN_LOCK lk;
public:
spinlock() : queuelocks() { }
Acquire( ) {
KeAcquireInStackQueuedSpinlock(&lk,&ql[KeGetCurrentProcessorNumber()]; }
Release( ) {
KeReleaseInStackQueuedSpinlock(&lk,&ql[KeGetCurrentProcessorNumber()]; }
}

There are race conditions: a small windows between
KeGetCurrentProcessorNumber() and real spinlock acquire. If above code is
called at PASSIVE_LEVEL thread can be rescheduled on another CPU in the
meantime. As Max pointed out before, you’d have to raise IRQL to
DISPATCH_LEVEL to make it safe.

Next complaints: it depends on current implementation; from docs it seems
there should be one handle per acquire and not per CPU. What is
MaxNumberOfProcessors? If you set it to for example 64, you waste space most
times and still it is possible the code would fail in the future. You can
allocate it dynamically depending on real conditions which makes code more
complicated. Compare it with stack acquisitor object; the only necessary
thing is to decrease ESP and in most cases the instruction is already there
and only constant changes. Elegant and efficient.

It is of course possible to use similar code as above but it is always more
complicated, less efficient and error prone. And for what? OOP purity? BTW,
how do you like, for example, STL functors?

Using the spinlock is as simple as s.Acquire() and s.Release(), all the
rest
of the work is done once at init time, and the code is localized in one
place:

s.Acquire();
{
// critical section goes here…
}
s.Release();

Exception unsafe… :wink:

This code also gets rid of those dynamic allocations that people objected
to. And if you don’t like that static variable, you can remove it at the
cost of grabbing additional memory for each spinlock, or you can
compromise
by having an array of PKLOCK_QUEUE_HANDLEs pointing to structures created
by
the constructor or even allocated at DriverEntry time.

Yes but still on-stack object per-acquire consumes less memory.

Another advantage is
that you can use it in multithreaded situations, for example, a
producer-consumer:

Thread A()
{
e.Acquire();
{
Produce();
}
f.Release();
}

Thread B()
{
f.Acquire();
{
Consume();
}
e.Release();
}

I believe only semaphores can used used above way. Imagine what happens if
producer is run before consumer and spinlock f which wasn’t acquired is
released. I don’t see how pure spinlock (without additional counter) could
be used for producer-consumer solution.

The above code has another advantage: I can quickly replace Acquire( ) and
Release( ) with machine code, or I can derive such an object, for example,

struct hardlock : public spinlock
{
Acquire ( ) { TestAndSetAssemblerMacroGoesHere( ); }
Release ( ) { SetSpinLockVariableToZeroByHand( ); }
}

which may be required at debug time when the OS is out to lunch. Moreover,
I
could even move Acquire( ) and Release( ) to the superclass, for example,

struct queuelocks
{
protected:
static KLOCK_QUEUE_HANDLE ql[MaxNumberOfProcessors];
KSPIN_LOCK s;
public:
queuelocks() { }
virtual Acquire() {
KeAcquireInStackQueuedSpinlock(s,&ql[KeGetCurrentProcessorNumber()]; }
virtual Release() {
KeReleaseInStackQueuedSpinlock(s,&ql[KeGetCurrentProcessorNumber()]; }
};

And now I have the option to override the Acquire and Release methods with
my own code (for example, a test, test and set), or with calls to plain
vanilla KeAcquireSpinLock( ) and to KeReleaseSpinLock( ): I can derive
multiple kinds of spinlocks from that one superclass. I can add common
debugging code to the queuelocks class too.

I don’t see a diffence against acquirer class. You can change it any
necessary way, too, without change of code which uses it.

There are umpteen variations on this theme. But of course, caveat emptor,
I
didn’t try the code in a real machine, so I don’t know if the compiler
will
generate good kernel-side code. The assumption of course is that you only
need one KLOCK_QUEUE_HANDLE per processor, meaning, a processor will never
be waiting on more than one spinlock.

Maybe I reacted too strongly, my apologies, but I don’t like things such
as
throwing exceptions from inside a critical section, I believe one should
release the spinlock first and then go deal with exception conditions.

Let’s forget about spinlock for a while and think about any synchronization
object as event, critical section, mutex and semaphore. There are two ways
how to handle error conditions with C++: classic status codes and C++
exceptions. If you decide for exceptions any function including operators
can throw an exception and you can’t assume code inside critical section
behaves differently. Throwing exception directly can seem silly but
generally you can’t avoid indirect throw from a function or operator called.
Actually, it is perfectly reasonable under normal circumstances (where
spinlock may not fall). It is necessary to be prepared for this situation
and that’s what were autolocking classes designed for. I don’t want to
discuss C++ exceptions pros and cons; the list has different purpose and I
don’t feel quite competent for it. They just enforce some rules which may
not be ignored; exception safe class design is one. Spinlocks enforce other
rules and both may not mix well. Personally, I like C++ exceptions in user
mode but don’t think they are appropriate for kernel use (no flames, please
:). Every environment has specific requirements and it is necessary to
choose tools and design code accordingly.

Best regards,

Michal Vodicka
STMicroelectronics Design and Application s.r.o.
[michal.vodicka@st.com, http:://www.st.com]

> ----------

From:
xxxxx@compuware.com[SMTP:xxxxx@compuware.com]
Reply To: xxxxx@lists.osr.com
Sent: Monday, September 29, 2003 9:06 PM
To: xxxxx@lists.osr.com
Subject: [ntdev] Re: Problem using queued spinlocks on single cpu
comp uter

I did indeed ignore a few implementation details, but I feel the IRQL
level
race can be handled; worse comes to worse, a cli/sti pair does wonders to
prevent preemption.

Do you mean:

__asm pushfd
__asm cli
KeAcquireInStackQueuedSpinlock(&lk,&ql[KeGetCurrentProcessorNumber()];
__asm popfd

Horrible, isn’t it? The problem is you have to prevent preemption around
whole acquire call because code between handle selection and actual lock
acquire has to be protected. No, only raise to DISPATCH_LEVEL is correct
solution with all negative consequences on performance (and where you store
original IRQL? :slight_smile:

As far as the one handle per acquire, I’m not sure I
like the idea of having more than one acquire per CPU, although, to be
fair,
that restriction would break my producer-consumer example ! In my code,
MaxNumberOfProcessors is an upper limit on the size of your SMP - say, 16,
32, 64. It’s a constant, and the space wasted isn’t that bad if your data
structure is small enough.

My point was you don’t know this size at compilation size for general
purpose driver. And even if you use the maximum possible, it can change in
the future. Yes, space waste isn’t too big but we need to compare it with
stack object solution.

Or you can ask the OS to get it for you and set
your data structure up accordingly at construction time.

Yes, I mentioned it. Again, compare complexity with stack-only solution.

But I was thinking, this whole thing is a bit academic. How about this for
a
barebones queued spinlock,

class qlock
{
private:
unsigned int curr, next;
public:
qlock() { curr = 0; next = 0; }
void Acquire() { unsigned int n = xAdd(next,1); while (n != curr); } //
take your number, wait till you’re called
void Release() { xAdd(curr,1); } // one more customer happily served,
next
!
};

The n=xAdd(a,1) construct is equivalent to an hw-level locked, indivisible
. This code is a straightforward application of the Bakery
> Algorithm,
> the queue only exists in our minds, and it wraps around at 0xffffffff. You
> can tweak it to serve whatever you need ! And, what’s an advantage to me,
> it
> works even if the OS’s down; besides, it takes less code to implement than
> other solutions that use the OS call. And look, ma, no handles !
>
OK, but now we’re far from original problem. BTW, it would deadlock at
DISPATCH_LEVEL and one CPU only.

> As far as the produce-consumer, if you have a spinlock class, it’s a
> simple
> matter to add a counter to it. I was just using the producer-consumer as
> an
> example of something you may want to control without necessarily using an
> on-stack alloc and release.
>
I believe spinlock isn’t a good tool for producer-consumer solution. But as
you said, it is a bit academic.

> One point in the code I’ve written this last time was to make sure that
> everything that requires object or memory management is done at init time.
> Again, it keeps all the spinlock maintenance in the same place, and it
> avoids allocations and deallocations while running.
>
Memory management necessary is one instruction i.e. sub esp,8. And whole
thing requires one line i.e. acquired variable declaration contrary to two
lines with your spinlock (lock, unlock).

> As for exceptions, I believe there’s a difference between spinlocks and
> other synchronization objects, in that in a spinlock a processor is
> looping
> and if we don’t handle things well, that processor’s going to hang. It may
> be a question of personal preference, but I don’t like to do anything
> inside
> a spinlock but the bare minimum I can get away with. I like code
> structured
> in layers, like an onion, and I like to see spinlock protected code at the
> very bottom !
>
OK, I’d agree spinlocks is special case. Exception safety was only one of
acquirer advantages mentioned and it is general design principle (or design
pattern) for critical sections locks. It always needs two class solutions;
lock class itself and acquirer-like class. For me it makes sense to design
spinlock class the same way even if exception safety advantage isn’t used.

Again: it is possible to design it the way you suggest but it is
unnecessarily complicated, error prone and has worse performance than two
class design. Look again at code Chuck or Max posted on the begin and
compare. The simplest design is usually the best one.

Best regards,

Michal Vodicka
STMicroelectronics Design and Application s.r.o.
[michal.vodicka@st.com, http:://www.st.com]

> ----------

From:
xxxxx@compuware.com[SMTP:xxxxx@compuware.com]
Reply To: xxxxx@lists.osr.com
Sent: Tuesday, September 30, 2003 4:42 PM
To: xxxxx@lists.osr.com
Subject: [ntdev] Re: Problem using queued spinlocks on single cpu
comp uter

I actually meant
pushf/cli/n=KeGetCurrentProcessorNumber/popf/KeAcquireIStackQueuedSpinlock
(&
lk,&ql[n]), but now I see that the window may not disappear. Yet what
burns
more cycles, a quick spin or a context switch ? I don’t know if the
increase
in I/O reactivity, which might be the natural consequence of allowing that
kind of preemption, pays off the additional aggravation and complexity in
managing the i/o complex. The way to speed up an interrupt system isn’t to
make it more reactive, but to lower the number of interrupts and the
latency
of handling them.

I don’t understand. The necessity to protect against context switch is
because of preemptive multitasking. I don’t see the relation to IO
reactivity. Well, the probability you’re preempted in a bad moment changes
but it has to be handled the same way even if probability is very low. Am I
missing something?

I don’t know, Michal, I don’t know. What I find horrible is the fact that
I
can’t trust my state even I’m doing kernel work. Is that good, safe,
efficient ? What do I gain by allowing an OS service to be preempted
arbitrarily to the point I can’t even trust my processor number ?

Normally, you shouldn’t need to know processor number. Code running at
PASSIVE_LEVEL is preemptive and I see it as an advantage (kernel threads).
What is an alternative? Cooperative multitasking in kernel? You can always
raise IRQL to DISPATCH_LEVEL to avoid preemption.

The difference is at what time we do memory management. I tried to design
it
in a way that all the maintenance is done at construct time, so that the
Acquire() call ends up generating inline code that does little more than
loop on the lock, in fact, it will probably generate inline code.

OK, generally it is better but in this case it causes unnecessary
complications. Acquirer class construction + lock acquire is very simple and
should also generate inline code.

The original problem was to achieve a queued spinlock. With one CPU, you
call Acquire(), it bumps the tail, you then get into the critical section
because now head==tail, then you call Release() which bumps the head. And
you could always bypass that code for the one-cpu case. If I remember, I
staggered the head and the tail to take that possibility into account, but
again, I’m losing my ability to handle detail. As for IRQL, fine, raise
the
IRQL first, but then I still believe that we’d be better off if we just
turned off interrupts in that processor during the spin. I don’t at all
mind
a solution that says pushf/cli/spin/popf, it’d save a lot of aggravation.

Disable interrupts just because some code waits for a lock? It affects whole
system because of local conditions. Do you really think it is a good idea?

>Memory management necessary is one instruction i.e. sub esp,8. And whole
>thing requires one line i.e. acquired variable declaration contrary to
two
>lines with your spinlock (lock, unlock).

I’m not sure I follow you here, you have to Acquire and you have to
Release:
two lines.

I mean acquirer class which releases automatically. Yes, code generated
would be very similar but C++ source is simpler.

A strong point of using OO is that we’re now able to code in ways that are
not accessible to plain vanilla block structured languages!

It doesn’t mean it must be used in cases where it causes unnecessary
complications. The goal of OOP is to simplify programming and make code more
readable and not to make developers headache because of purity.

I don’t see why
one well-encapsulated class is any more error prone or more complicated
than
anything else, and the complication if any comes from the semantics of the
call itself and of the OS, and not from the inherent problem !

Yes, it is this case. OS enforces some rules and class design have to fulfil
them even if you don’t like it.

I also have a
problem with implicit actions such as the combination of
Release-on-destruct
and destruct-on-exit, which leads to an implicit Release-on-exit: this may
be convenient and easy to code, but I believe spinlocks are important
enough
that we should explicitly release them, if nothing else in consideration
with the guys who will eventually inherit the code.

This depends on personal preferences. Automatic locks are widely used and
shouldn’t confuse experienced C++ developers. On the contrary, the necessity
to manually unlock can be confusing to somebody who routinely uses
autolocks.

I also don’t like to use
two objects here, and I find the limitation that spinlocks must be
acquired
and released within the same stack frame, well, a limitation.

You may not like it but it is the reality. They were designed this way which
means using them other ways is complicated.

My approach has to do with the way I see spinlocks: a hardware level
mechanism, that should be at the very bottom of the OS layering, that
should
not have to know about any other OS mechanisms, including memory
management
and exception throwing. A mechanism that should be used in a tight
setting,
loop on the lock, do your job, release the lock, and make sure your job is
as small as possible: a fine grain tool, not a coarse grain one !

I’d agree. With these requirements it makes sense to use stack frame,
doesn’t it?

Hmm, guess why queued spinlock acquire/release routines have …InStack…
in the name. Doesn’t it imply usage?

Best regards,

Michal Vodicka
STMicroelectronics Design and Application s.r.o.
[michal.vodicka@st.com, http:://www.st.com]

> ----------

From:
xxxxx@compuware.com[SMTP:xxxxx@compuware.com]
Reply To: xxxxx@lists.osr.com
Sent: Wednesday, October 01, 2003 12:32 AM
To: xxxxx@lists.osr.com
Subject: [ntdev] Re: Problem using queued spinlocks on single cpu
comp uter

[snip]

But the issue is not preemption, the issue’s interruption. Apart from the
potential undesirability of letting a time slice context switch to take
place inside a spinlock loop - that’s about preemption allright - we also
have the potential undesirability of allowing interrupts within a
spinlocked
critical section. The “raise the IRQL” trick sits on the assumption that
well, if my interrupt routine is running at IRQL level umpteen and
busywaits, that no other routine at a higher IRQL will busywait on the
same
piece of memory, or a deadlock may occur. So, I don’t know if I’m at
passive
level or at dispatch level, but in either case a spinlock should be
protected from preemption, although not necessarily from interruption.

Ah, I see. You don’t like the way OS is implemented now. Of course, there
are disadvantages against what you offer but also advantages. To simplify
things we can see it as three levels, passive which allows preemption,
dispatch which allows interruption and interrupt level. If I understand
correctly, you claim two levels should be enough. I’d see it as return to
windos world.

>Disable interrupts just because some code waits for a lock? It affects
whole
>system because of local conditions. Do you really think it is a good
idea?

Disabling interrupts only affects that one CPU anyway, and spinlocks are
supposed to be very fine grained ! Note well, raising the IRQL is a form
of
disabling interrupts, it’s a partial interrupt disable. The difference
between raising the IRQL and issuing a CLI here is basically how much I/O
interruption we’re willing to allow that one CPU to be subjected to, and
again, the issue becomes one of reactivity: what do I really gain by
allowing that to happen ? If we were running on a marvellously fast bus
with
an immense i/o bandwidth, well, maybe, but we’re running on a 133Mhz bus.
Take a 3Ghz machine, and that one CPU can run over 20 instructions during
the time it takes for the PCI bus to perform one interrupt cycle. I’d be
curious to figure out how many instructions a 3Ghz Xeon CPU can execute
during the average PCI bus interrupt latency, that’s going to be a measure
of how many interrupts we’d possibly lose by allowing a 50-instruction
critical section, say, to run on a CPU with interrupts disabled.

Makes sense. However, when NT were designed, 486 (or 386?) was the fastest
PC available and situation was different. Even if there are very good
reasons for removing IRQLs I believe nobody could dare it now.

So, I’m not
sure I’m worried about affecting the whole system, unless of course one
abuses of the concept of a spinlocked critical section.

And that’s a problem because then OS stability depends of average driver
quality. Several years before I used NT4 on a PC with SCSI adapter from one
big and well known company. There was no apparent problems with adapter
driver (if I ignore one BSOD per month during boot) running on free build.
Things were different with checked build. Usually, assertion failure was
raised during this driver initialization; something as
SpinLockSpinningTooLong and if I remember correctly, it meant longer spin
than 0.5 sec. Next example is NDIS and serialized miniports. Drivers
handlers are called with spinlock held and processing involves at least
several function calls, can take whole packet copy and maybe encryption
(original serialized IM drivers). It would be real problem if interruption
isn’t allowed.

OO gives us new programming paradigms. I shouldn’t be programming C++ in C
style any more than I should be programming C in assembler style ! I
daresay
that C++ programming should simplify programming to OO programmers, and it
should make code more readable to OO programmers !

OK but here you have to deal with C API and encapsulate it to C++ classes.
Even if you believe it is a good idea, class design is influenced by API
idiosyncrasies. It is better to use the easiest possibility and not to make
complicated workarounds.

I’m not complaining about the rules, I’m complaining about the loose
semantics. For example, the hw architecture doesn’t allow an interrupt
between a move to esp and a move to ss. Maybe the OS should not allow a
task
to be rescheduled to another processor between the time one reads the
processor number and the time one uses it ?

I don’t see how it could be implemented. Using special compiler
construction? You’d be the first one complaining :wink:

You know, we used to have automatic typing in Fortran, and that was turned
down in favor of explicit typing. And it was done for a reason !

Two quite different things. Or do you mean everything should be explicit?
What about destructors? Back to Turbo Pascal where destructors had to be
called explicitly and yes, there were advantages.

Synchronization code, likewise, will benefit from explicitly acquiring and
freeing those synchronization constructs. But you may be right, it could
be
a question of personal preference.

There is always a choice. It is possible to design a class which allows both
implicit and explicit release.

>I’d agree. With these requirements it makes sense to use stack frame,
doesn’t it?

Not necessarily: these things don’t need to be in the same thread.

Which would violate your requirement for small critical section :wink:

Sorry for beeing brief, I guess the debate is too academic now and should
stop sometimes. Anyway, it has one really good effect for me: it enforced me
to examine how queued spinlock are really implemented and beeing lazy to
start IDA I used SoftICE and found so strange code I eventually noticed part
of symbols loaded is obsolete (kicking head and swearing :). This explains
some debugging anomalies I encountered within past weeks. Everything started
work well with correct symbols, of course… An idea for SI: warn about
wrong symbols as WinDbg does. I mean symbols loaded using winice.dat because
it is probably hard to make similar stupid error with dynamically loaded
ones. I updated OS and forget about statically loaded NMSs before…

Best regards,

Michal Vodicka
STMicroelectronics Design and Application s.r.o.
[michal.vodicka@st.com, http:://www.st.com]

Yes, I realize that. Thanks to you all, I’ve learned a few more things about
the NT architecture that I wasn’t quite aware of !

Alberto.

-----Original Message-----
From: Maxim S. Shatskih [mailto:xxxxx@storagecraft.com]
Sent: Tuesday, September 30, 2003 6:25 PM
To: Windows System Software Devs Interest List
Subject: [ntdev] Re: Problem using queued spinlocks on single cpu comp
uter

KeAcquireInStackQueuedSpinlock(&lk,&ql[KeGetCurrentProcessorNumber()]; }
Release( ) {
KeReleaseInStackQueuedSpinlock(&lk,&ql[KeGetCurrentProcessorNumber()]; }

Don’t forget to raise to DISPATCH_LEVEL, or KeGetCurrentProcessorNumber will
be
invalid.

Maxim Shatskih, Windows DDK MVP
StorageCraft Corporation
xxxxx@storagecraft.com
http://www.storagecraft.com


Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256

You are currently subscribed to ntdev as: xxxxx@compuware.com
To unsubscribe send a blank email to xxxxx@lists.osr.com

The contents of this e-mail are intended for the named addressee only. It
contains information that may be confidential. Unless you are the named
addressee or an authorized designee, you may not copy or use it, or disclose
it to anyone else. If you received it in error please notify us immediately
and then destroy it.