RE: Documentation Bug: "Handling IRPs: What Every Driver Writer Needs to

Slava,

You are synchronizing back to your dispatch routine. As you can see the
driver’s completion routine above you will always be called before the
IoCallDriver that calls your dispatch routine completes.

And if your own completion routine is called after your IoCallDriver call
completes, you are sure that IoCallDriver return status is STATUS_PENDING,
so that code has a severe performance penalty.

Thanks,
mK


Don’t just search. Find. Check out the new MSN Search!
http://search.msn.click-url.com/go/onm00200636ave/direct/01/

My answer was not a disproof of your statement, I agreed with you when I
said “Yes” .
The code was only a demonstration that it is possible to return from the
IoCallDriver in a synchronous request with an uncompleted Irp. It is
obviously that my completion routine was called before returning from the
IoCallDriver.
The penalty for a synchronous request is low, because the object is in the
signal state while calling KeWaitFor…

“Misha Karpin” wrote in message
news:xxxxx@ntdev…
> Slava,
>
> You are synchronizing back to your dispatch routine. As you can see the
> driver’s completion routine above you will always be called before the
> IoCallDriver that calls your dispatch routine completes.
>
> And if your own completion routine is called after your IoCallDriver call
> completes, you are sure that IoCallDriver return status is STATUS_PENDING,
> so that code has a severe performance penalty.
>
> Thanks,
> mK
>
> _________________________________________________________________
> Don’t just search. Find. Check out the new MSN Search!
> http://search.msn.click-url.com/go/onm00200636ave/direct/01/
>
>
>

We are talking about different things, you are rigth of course.

About the penalty performance, I disagree. You must set the event only when
the lower driver returned STATUS_PENDING, that is you must check for
(Irp->PendingReturned == TRUE) in your completion routine and adapt your
dispatch accordingly. This removes the need to call KeSetEvent and improves
performance because the OS does not have to acquire the lock in all cases.
IMHO you must avoid calling KeSetEvent on IRQL==DISPATCH_LEVEL, as you would
do while holding a lock.

Thanks,
mK


Express yourself instantly with MSN Messenger! Download today - it’s FREE!
http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/

Must is an awfully strong word.

For a driver which is going to handle requests once or twice a second
this optimization wouldn’t even be noticable.

Personally i’d rather every driver be as simple as possible than every
driver be as fast as possible. Simple causes less system crashes :slight_smile:

-p

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Misha Karpin
Sent: Wednesday, March 08, 2006 8:44 AM
To: Windows System Software Devs Interest List
Subject: RE: [ntdev] Documentation Bug: "Handling IRPs: What Every
Driver Writer Needs to

We are talking about different things, you are rigth of course.

About the penalty performance, I disagree. You must set the event only
when the lower driver returned STATUS_PENDING, that is you must check
for (Irp->PendingReturned == TRUE) in your completion routine and adapt
your dispatch accordingly. This removes the need to call KeSetEvent and
improves performance because the OS does not have to acquire the lock in
all cases.
IMHO you must avoid calling KeSetEvent on IRQL==DISPATCH_LEVEL, as you
would do while holding a lock.

Thanks,
mK


Express yourself instantly with MSN Messenger! Download today - it’s
FREE!
http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/


Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

I do not understand what did you mean in saying: “you must avoid calling
KeSetEvent on IRQL==DISPATCH_LEVEL, as you would do while holding a lock.”

Why do you object to using KeSetEvent at the DISPATCH_LEVEL?
Do you know another method for a synchronizing the completion routine with
the thread which calls IoCallDriver and wants to wait for an Irp completion
at IRQL <=APC_LEVEL?
What is the method which does not use any syncronization object with
DISPATCHER_HEADER and allows to reschedule the waiting thread?

“Misha Karpin” wrote in message
news:xxxxx@ntdev…
> We are talking about different things, you are rigth of course.
>
> About the penalty performance, I disagree. You must set the event only
> when the lower driver returned STATUS_PENDING, that is you must check for
> (Irp->PendingReturned == TRUE) in your completion routine and adapt your
> dispatch accordingly. This removes the need to call KeSetEvent and
> improves performance because the OS does not have to acquire the lock in
> all cases. IMHO you must avoid calling KeSetEvent on IRQL==DISPATCH_LEVEL,
> as you would do while holding a lock.
>
> Thanks,
> mK
>
> _________________________________________________________________
> Express yourself instantly with MSN Messenger! Download today - it’s FREE!
> http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/
>
>
>

I agree with you, especially because a time spent for a processing a single
page fault may exceed all the performance gain.

“Peter Wieland” wrote in message
news:xxxxx@ntdev…
Must is an awfully strong word.

For a driver which is going to handle requests once or twice a second
this optimization wouldn’t even be noticable.

Personally i’d rather every driver be as simple as possible than every
driver be as fast as possible. Simple causes less system crashes :slight_smile:

-p

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Misha Karpin
Sent: Wednesday, March 08, 2006 8:44 AM
To: Windows System Software Devs Interest List
Subject: RE: [ntdev] Documentation Bug: "Handling IRPs: What Every
Driver Writer Needs to

We are talking about different things, you are rigth of course.

About the penalty performance, I disagree. You must set the event only
when the lower driver returned STATUS_PENDING, that is you must check
for (Irp->PendingReturned == TRUE) in your completion routine and adapt
your dispatch accordingly. This removes the need to call KeSetEvent and
improves performance because the OS does not have to acquire the lock in
all cases.
IMHO you must avoid calling KeSetEvent on IRQL==DISPATCH_LEVEL, as you
would do while holding a lock.

Thanks,
mK

_________________________________________________________________
Express yourself instantly with MSN Messenger! Download today - it’s
FREE!
http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/


Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

Peter Wieland wrote:

Personally i’d rather every driver be as simple as possible than every
driver be as fast as possible. Simple causes less system crashes :slight_smile:

Amen to that, or “word up”, as I hear the kids say. Most
micro-optimizations, and many macro-optimizations, are misplaced. With
some exceptions, CPUs are now infinitely fast and memory is infinitely
large. Build the product by the rules, and then use some tangible
metric to decide whether it is “fast enough”. Only when it fails that
test should one open a bag of tricks.


Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

> Why do you object to using KeSetEvent at the DISPATCH_LEVEL?

Yes, I also cannot understand this. For me, KeSetEvent on DISPATCH is fine.
Waits are not fine, but setting the event is fine.

Maxim Shatskih, Windows DDK MVP
StorageCraft Corporation
xxxxx@storagecraft.com
http://www.storagecraft.com

Slava Imameyev wrote:

I do not understand what did you mean in saying: “you must avoid calling
KeSetEvent on IRQL==DISPATCh_LEVEL, as you would do while holding a lock.”

Why do you object to using KeSetEvent at the DISPATCH_LEVEL? Do you know
another method for a synchronizing the completion routine with the thread
which calls IoCallDriver and wants to wait for an Irp completion at IRQL
<=APC_LEVEL? What is the method which does not use any syncronization object
with DISPATCHER_HEADER and allows to reschedule the waiting thread?

Response:

“Managing Hardware Priorities” on the WDK: “Driver code that runs at IRQL >
PASSIVE_LEVEL should execute as quickly as possible. The higher the IRQL at
which a routine runs, the more important it is for good overall performance
to tune that routine to execute as quickly as possible.”

KeSetEvent help on WDK: Calling KeSetEvent causes the event to attain a
signaled state. If the event is a notification event, the system attempts
satisfy as many waits as possible on the event object before clearing the
event. If the event is a synchronization event, one wait is satisfied before
the event is cleared.

This is, KeSetEvent routine will acquire dispatcher lock, which could be
already acquired by another processor, therefore hurting the performance.
Performance optimization is about avoiding resource contention. Then
KeSetEvent routine will satisfy a waiting thread readying it for execution.
I think all this is a lot of needless work in a synchronous IRP.

Of course, the use of KeSetEvent is needed for the asynchronous case.

Peter Wieland wrote:

Must is an awfully strong word.

For a driver which is going to handle requests once or twice a second this
optimization wouldn’t even be noticable.

Personally i’d rather every driver be as simple as possible than every
driver be as fast as possible. Simple causes less system crashes :slight_smile:

Response:

Sorry for my english. I agree with you, but I think that as simple as
possible does not mean as simple as not causing crashes. Also I think a
driver is a lot simpler if only executes necessary instructions. One of the
most common performance pitfalls is doing needless work.

Slava Imameyev wrote:

I agree with you, especially because a time spent for a processing a single
page fault may exceed all the performance gain.

Response:

It depends of the frecuency of the page faults and the routine being called.

Tim Roberts wrote:

Amen to that, or “word up”, as I hear the kids say. Most
micro-optimizations, and many macro-optimizations, are misplaced. With some
exceptions, CPUs are now infinitely fast and memory is infinitely large.
Build the product by the rules, and then use some tangible metric to decide
whether it is “fast enough”. Only when it fails that test should one open a
bag of tricks.

Response:

CPUs are not so infinitely fast, as now we have more and more processors in
each computer. This is actually the reason because resource contention is so
important. Let me mention Amhdal law´s:

P( n, h) = 1 / ( 1 + (n -1) * h) , where n = number of processors and h =
%time lock hold.

Now suppose a 32 processors computer, and the dispatcher lock acquired 1% of
the time. That is P = 0,7633%. This is, 23% of the time each processor will
be spinning for the lock. Then, I have a good question. How are you going to
use any tangible metric under so many diffents loads and scenarios?. IMHO
performance engineering is not something that you can add at the end of
development, because optimization is not doing tricky arithmetic, tunning
loops or using assembly language.

Thanks,
mK


Don’t just search. Find. Check out the new MSN Search!
http://search.msn.com/

You are correct that performance engineering is not something you can do at the end of development.

Performance engineering also is not applying every possible optimization to your code. This is a waste of development time and increases risk by adding additional branches, additional code paths that need to be tested, and additional edge cases that you’ll never be able to exercise properly.

Performance engineering involves understanding the system at the larger scale, knowing where the bottlenecks are and measuring to determine where you are spending large amounts of time in critical paths. You can easily optimize the code and end up with no benefit if your design isn’t prepared to run any faster.

So you should ask what percentage of the entire operation is consumed in calling KeSetEvent. If it is noticible (which would mean the entire operation was short) then I would first ask “why are you doing this operation synchronously in the first place” before I would suggest such a small change as a “performance improvement”. For a short, critical path operation I would move the post processing into the completion routine, spending a little more time at dispatch level in trade for removing the overhead of a potential context switch for each I/O operation.

-p

-----Original Message-----
From: xxxxx@lists.osr.com [mailto:xxxxx@lists.osr.com] On Behalf Of Misha Karpin
Sent: Wednesday, March 08, 2006 4:03 PM
To: Windows System Software Devs Interest List
Subject: RE: [ntdev] Documentation Bug: "Handling IRPs: What Every Driver Writer Needs to

Slava Imameyev wrote:

I do not understand what did you mean in saying: “you must avoid calling KeSetEvent on IRQL==DISPATCh_LEVEL, as you would do while holding a lock.”

Why do you object to using KeSetEvent at the DISPATCH_LEVEL? Do you know another method for a synchronizing the completion routine with the thread which calls IoCallDriver and wants to wait for an Irp completion at IRQL <=APC_LEVEL? What is the method which does not use any syncronization object with DISPATCHER_HEADER and allows to reschedule the waiting thread?

Response:

“Managing Hardware Priorities” on the WDK: “Driver code that runs at IRQL > PASSIVE_LEVEL should execute as quickly as possible. The higher the IRQL at which a routine runs, the more important it is for good overall performance to tune that routine to execute as quickly as possible.”

KeSetEvent help on WDK: Calling KeSetEvent causes the event to attain a signaled state. If the event is a notification event, the system attempts satisfy as many waits as possible on the event object before clearing the event. If the event is a synchronization event, one wait is satisfied before the event is cleared.

This is, KeSetEvent routine will acquire dispatcher lock, which could be already acquired by another processor, therefore hurting the performance.
Performance optimization is about avoiding resource contention. Then KeSetEvent routine will satisfy a waiting thread readying it for execution.
I think all this is a lot of needless work in a synchronous IRP.

Of course, the use of KeSetEvent is needed for the asynchronous case.

Peter Wieland wrote:

Must is an awfully strong word.

For a driver which is going to handle requests once or twice a second this optimization wouldn’t even be noticable.

Personally i’d rather every driver be as simple as possible than every driver be as fast as possible. Simple causes less system crashes :slight_smile:

Response:

Sorry for my english. I agree with you, but I think that as simple as possible does not mean as simple as not causing crashes. Also I think a driver is a lot simpler if only executes necessary instructions. One of the most common performance pitfalls is doing needless work.

Slava Imameyev wrote:

I agree with you, especially because a time spent for a processing a single page fault may exceed all the performance gain.

Response:

It depends of the frecuency of the page faults and the routine being called.

Tim Roberts wrote:

Amen to that, or “word up”, as I hear the kids say. Most micro-optimizations, and many macro-optimizations, are misplaced. With some exceptions, CPUs are now infinitely fast and memory is infinitely large.
Build the product by the rules, and then use some tangible metric to decide whether it is “fast enough”. Only when it fails that test should one open a bag of tricks.

Response:

CPUs are not so infinitely fast, as now we have more and more processors in each computer. This is actually the reason because resource contention is so important. Let me mention Amhdal law?s:

P( n, h) = 1 / ( 1 + (n -1) * h) , where n = number of processors and h =
%time lock hold.

Now suppose a 32 processors computer, and the dispatcher lock acquired 1% of the time. That is P = 0,7633%. This is, 23% of the time each processor will be spinning for the lock. Then, I have a good question. How are you going to use any tangible metric under so many diffents loads and scenarios?. IMHO performance engineering is not something that you can add at the end of development, because optimization is not doing tricky arithmetic, tunning loops or using assembly language.

Thanks,
mK


Don’t just search. Find. Check out the new MSN Search!
http://search.msn.com/


Questions? First check the Kernel Driver FAQ at http://www.osronline.com/article.cfm?id=256

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer

I agree with you in the fundamental concepts.

Anyway this is a forum with a lot of newbies looking for support, so I think
it was appropriate to note that this code is not optimal for synchronizing
back to dispatch.

Now we can discuss about this is a performance optimization or not, and
maybe our conclusions would be the same. IMHO that code has a defective
simplification.

On this same Microsoft paper
(http://www.microsoft.com/whdc/driver/kernel/IRPs.mspx) about IRPs,
“Synchronous I/O Responses” section:

"One way to implement a synchronous I/O design is shown in the following
code fragment:
// Register something that will set an event.
// (Not shown)

// Send the IRP down the device stack
IoCallDriver(nextDevice, Irp);

// Wait on an event to be signaled
KeWaitForSingleObject( &event, … );

// Get the final status
status = Irp->IoStatus.Status;

However, this design has a serious problem: the KeWaitForSingleObject
routine uses the system-wide dispatcher lock. This lock protects the signal
state of events, semaphores, and mutexes, and consequently is used
frequently throughout the operating system. Requiring the use of this lock
for every synchronous I/O operation would unacceptably hinder performance.
To avoid this problem, the IoCallDriver routine was designed to return a
status value."

KeSetEvent acquires the same lock, so the example have twice the same
performance problem.

Thanks,
mK

Peter Wieland wrote:

You are correct that performance engineering is not something you can do at
the end of development.

Performance engineering also is not applying every possible optimization to
your code. This is a waste of development time and increases risk by adding
additional branches, additional code paths that need to be tested, and
additional edge cases that you’ll never be able to exercise properly.

Performance engineering involves understanding the system at the larger
scale, knowing where the bottlenecks are and measuring to determine where
you are spending large amounts of time in critical paths. You can easily
optimize the code and end up with no benefit if your design isn’t prepared
to run any faster.

So you should ask what percentage of the entire operation is consumed in
calling KeSetEvent. If it is noticible (which would mean the entire
operation was short) then I would first ask “why are you doing this
operation synchronously in the first place” before I would suggest such a
small change as a “performance improvement”. For a short, critical path
operation I would move the post processing into the completion routine,
spending a little more time at dispatch level in trade for removing the
overhead of a potential context switch for each I/O operation.

-p


FREE pop-up blocking with the new MSN Toolbar - get it now!
http://toolbar.msn.click-url.com/go/onm00200415ave/direct/01/

> CPUs are not so infinitely fast, as now we have more and more processors in

If you’re running the CPU-intensive stuff like compression or crypto of bulk
amounts of data, or the multimedia codecs - then yes.

But surely the CPU is infinitely fast if we’re speaking about KeSetEvent
overhead.

Maxim Shatskih, Windows DDK MVP
StorageCraft Corporation
xxxxx@storagecraft.com
http://www.storagecraft.com

> However, this design has a serious problem: the KeWaitForSingleObject

routine uses the system-wide dispatcher lock.

So what? You just cannot do the sync IO without ever calling
KeWaitForSingleObject. Just no ways, so, let’s just forget about perf
implications of the dispatcher.

Look at any DDK sample. All sync IO there is using KeWaitForSingleObject and
KeSetEvent.

Maxim Shatskih, Windows DDK MVP
StorageCraft Corporation
xxxxx@storagecraft.com
http://www.storagecraft.com

Of course. The DDK samples syncronize IO in the optimal way. Take a look at
FileSpy!SpyPassThrough and FileSpy!SpyPassThroughCompletion.

Thanks,
mK

Maxim Shatskih wrote:

So what? You just cannot do the sync IO without ever calling
KeWaitForSingleObject. Just no ways, so, let’s just forget about perf
implications of the dispatcher.

Look at any DDK sample. All sync IO there is using KeWaitForSingleObject and
KeSetEvent.


Express yourself instantly with MSN Messenger! Download today it’s FREE!
http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/