Kernel-mode stack space (spinoff of "ZwReadFile Speed Problem")

I gave this a new topic because it’s pretty OT and I don’t want to
clutter up the original thread with it.
Quotes are from the “ZwReadFile Speed Problem.” thread.
Hope you guy’s don’t mind.

For non-involved readers: I suggested a recursive sort (quicksort)
and Steve replied recursive sorts should be avoided in the kernel.


Hi Steve,

> Hi Don!

That was me (Steve), actually; don’t blame Don for my rants. :slight_smile:

well, er, I’m sorry for that mistake - hope it doesn’t happen again!
(Messed up 2 messages I was answering to)

> (…)
It’s still better to flatten your sort out into a for loop or
something. You leave yourself open to security problems if someone
feeds you a big buffer to sort - it’s a DoS at least.

I ment for really known sizes like when the driver is queuing stuff
in a self-allocated list/array (that has a known & fixed size).
For unknown sizes one can always check the size and decide what
sort-function to use.

> (…)
Depends on where you are in the call stack and how much stack you
allocate in each frame. Remember, the kernel has a fixed 12KB stack. I
guess you could do something goofy with IoGetRemainingStackSize(), but
it’d be better to just use a flat sort IMO.

Actually I thought it was 8K (2 pages), but if you say 12K that leaves me
plenty of room for nasty stuff :slight_smile:

> (…)
Funny you should ask. :slight_smile: see above, and also look at
IoGetStackLimits().

Yeah, right, hit me :slight_smile:

Unfortunately the functions you mentioned only work for
IRQL < DISPATCH_LEVEL… (at least DDK says so).
What does one do if one’s executing a DPC or ISR? I mean counting by
hand is really really cumbersome when there are a lot of callbacks
involved… Also I don’t really 100% trust the compiler (c++) to free
stack-space e.g. for nested local scopes in a function right where
the scope ends - especially not for “plain-old-data” variables…
Or is the only possible way saving the value of ESP when entering the
ISR and comparing it to the actual value in every called function?

I mean I’d like to write “flat” code (concerning call-depth), but the
usual “the day before yesterday” deadline really doesn’t help writing
“perfect” code… and most of the time it’s faster to just write some
piece of code and put a lot of ASSERT()s in than to “do the numbers”
on every possibility…

At least I don’t allocate buffers of unknown/non-const size on the
stack :slight_smile:

Regards,

Paul Groke

On Mon, 2004-08-09 at 18:37, xxxxx@tab.at wrote:

> That was me (Steve), actually; don’t blame Don for my rants. :slight_smile:

well, er, I’m sorry for that mistake - hope it doesn’t happen again!
(Messed up 2 messages I was answering to)

No worries here; I’m quite happy to be confused with Don. :slight_smile: The
reverse is most problably not the case though!

> > (…)
> It’s still better to flatten your sort out into a for loop or
> something. You leave yourself open to security problems if someone
> feeds you a big buffer to sort - it’s a DoS at least.

I ment for really known sizes like when the driver is queuing stuff
in a self-allocated list/array (that has a known & fixed size).
For unknown sizes one can always check the size and decide what
sort-function to use.

True enough. I’m still sticking to my advice in general, though, as
many implementers don’t think through this issue as carefully as you
have.

> > (…)
> Depends on where you are in the call stack and how much stack you
> allocate in each frame. Remember, the kernel has a fixed 12KB stack. I
> guess you could do something goofy with IoGetRemainingStackSize(), but
> it’d be better to just use a flat sort IMO.

Actually I thought it was 8K (2 pages), but if you say 12K that leaves me
plenty of room for nasty stuff :slight_smile:

I think it was 8K on <= nt4 and 12K after - I suppose the Microsoft
folks upped the stack by a page in order to support the deeper layering
of driver in WDM. Also, this might only be x86 (alpha had 8k
pages…). Someone correct me if I’m wrong?

> > (…)
> Funny you should ask. :slight_smile: see above, and also look at
> IoGetStackLimits().

Yeah, right, hit me :slight_smile:

Unfortunately the functions you mentioned only work for
IRQL < DISPATCH_LEVEL… (at least DDK says so).
What does one do if one’s executing a DPC or ISR? I mean counting by
hand is really really cumbersome when there are a lot of callbacks
involved…

Maybe post it off to a worker thread? If it’s true in general that you
shouldn’t stay at DIRQL any longer than necessary, it’s also true for
DPC level. You don’t want to hog a CPU forever while you sort a big
list (quicksort is O(n^2) worst-case IIRC), and besides, that way you
get a fresh stack to work with.

One thing I don’t understand: why are the stack-measuring functions
IRQL < DISPATCH_LEVEL?

-sd

Hi again Steve,

Maybe post it off to a worker thread? If it’s true in general that you
shouldn’t stay at DIRQL any longer than necessary, it’s also true for
DPC level. You don’t want to hog a CPU forever while you sort a big
list (quicksort is O(n^2) worst-case IIRC), and besides, that way you
get a fresh stack to work with.

well, again that’s something where I really have no choice. Actually I’m
not doing any sorts in the driver at all - I just wanted to discuss the
topic recursive sorts out of personal interest because I saw no reason
to completely ban recursive algorithms in kernelmode code.

In general I agree - one should do as much as possible at as low a IRQL
as possible.

What I do at DIRQL is servicing some devices like:
* UART with 8 byte FIFO (much too small to be serviced from DPC-level)
* parallel driven banknote-acceptors/coin-acceptors
* parrallel driven hoppers
* debounced inputs

all connected to a digital-IO board. For the banknote-acceptor e.g. it’s
critical to have low response times in some situations, and for the hopper
even more, because if the latency when switching the motor-control-line
gets too high additional coins get paid out… not good.
To complete the fun the driver has to support different kinds of devices
and different mappings to the io-lines as wells as 3 different board-types
all configured via the registry at driver-load-time; much of the code
has to use runtime-binding in a lot of places, at least I don’t see
any other (better) way.
So I have to do much nasty stuff in the ISR that uses plenty of callbacks
(actually virtual function-calls).
All in all the ISR executes in about 25-35usec with a typical setup
(like 1 coin-acceptor, 1 note-acceptor, ~20 software-debounced
digital inputs and 1 UART) - so I think concerning execution-time it’s
not a big problem.
Also I think that saving one or two port IOs from or to the PCI-card
buys me a lot of CPU-instructions, so I focused on that rather than on
cutting down CPU-instructions.
I forgot to mention: all that runs in a 4ms periodic h/w-interrupt.

One thing I don’t understand: why are the stack-measuring functions
IRQL < DISPATCH_LEVEL?

Well, I was very surprised myself but the DDK clearly states < DPC-level,
not even <= - very strange…

I think I might try the save/compare ESP thingy - maybe it gives good
results for the ISR - and at least I’d know how much stack I really use
:slight_smile:

Regards,

Paul Groke

xxxxx@tab.at wrote:

all connected to a digital-IO board. For the banknote-acceptor e.g. it's
critical to have low response times in some situations, and for the hopper
even more, because if the latency when switching the motor-control-line
gets too high additional coins get paid out... not good.

Bad design. Repeat after me: Windows is not a Real Time Operating System!

You cannot be guaranteed *any* particular latency on interrupts or DPCs
in Windows. The system can go off on it's own and do quite a variety of
tasks for extended periods of time.

Design your hardware so that *it* does all the critical timing-related
tasks.

To be fair, it will be *very rare* that a well controlled system is off
in la la land for longer than 100ms or so, but you can't guarantee it
without some pretty serious real-time extensions (if then :-).

../ray..

Please remove ".spamblock" from my email address if you need to contact
me outside the newsgroup.

On Tue, 2004-08-10 at 12:07, xxxxx@tab.at wrote:

What I do at DIRQL is servicing some devices like:
all connected to a digital-IO board. For the banknote-acceptor e.g. it’s
critical to have low response times in some situations, and for the hopper
even more, because if the latency when switching the motor-control-line
gets too high additional coins get paid out… not good.

I know Ray said this already, but this is not what NT was designed to
do. You need an RTOS.

I think I might try the save/compare ESP thingy - maybe it gives good
results for the ISR - and at least I’d know how much stack I really use

Yep. Just remember, it’s not the least bit portable.

Interesting discussion. :slight_smile:

-Steve

Hi Ray & Steve!

xxxxx@synaptics.spamblock.com wrote:

Bad design. Repeat after me: Windows is not a Real Time Operating
System!

for( int i = 0; i < 1000; i++ )
{
printf( “Windows is not a Real Time Operating System!\n” );
Beep( 1000, 1000 );
}

You cannot be guaranteed *any* particular latency on interrupts or DPCs
in Windows. The system can go off on it’s own and do quite a variety of
tasks for extended periods of time.

I’ve already noticed that.

Design your hardware so that *it* does all the critical timing-related
tasks.

Well, you know, it’s a running system. The driver works, even with windows
not being an RTOS. I’m not 100% happy with the board, but as I just said,
it
works pretty well, and then I’m just the guy who does the driver, right?

I just wanted to make clear why certain complicated code can’t be moved
from
ISR down to some DPC or even a driver-thread. And I’m sure you agree that
it’s better to have an “unknown ISR latency” then to have an “unknown DPC
or
driver-thread latency” in this case, considering that I can’t change the
h/w
right now.


xxxxx@positivenetworks.net wrote:

> I think I might try the save/compare ESP thingy - maybe it gives good
> results for the ISR - and at least I’d know how much stack I really
use

Yep. Just remember, it’s not the least bit portable.

Interesting discussion. :slight_smile:

To be honest I don’t care if it’s portable. I’m just extending a 2000/XP
driver for x86 right now - nothing more. Also this will only be a checked-
build assertion, so I don’t see a problem there.
(I’m not talking about switching stacks or something like that)

Regards,

Paul Groke