Parallel process overlapped requests from a single thread in kernel mode?

Is there a way to concurrently process, inside the kernel, an arbitrary
number of overlapped IOs issued by a single thread? Multiple issuing
threads is easy, because you are in a different thread context for each
one.

Thanks,

Phil
Philip D. Barila
Seagate Technology LLC
(720) 684-1842

Well handing the request off to a work item, DPC or kernel thread after
pending it in the dispatch routine is the obvious way. I’ve used this for
inverted calls, where the initialization primes the pump with all the
requests, then the driver completes them as needed.


Don Burn (MVP, Windows DDK)
Windows 2k/XP/2k3 Filesystem and Driver Consulting
http://www.windrvr.com
Remove StopSpam from the email to reply

wrote in message news:xxxxx@ntdev…
> Is there a way to concurrently process, inside the kernel, an arbitrary
> number of overlapped IOs issued by a single thread? Multiple issuing
> threads is easy, because you are in a different thread context for each
> one.
>
> Thanks,
>
> Phil
> Philip D. Barila
> Seagate Technology LLC
> (720) 684-1842
>

Well if you want to force it to happen you can always farm the work off
onto different worker threads. And you can’t ever be entirely sure that
the driver above you hasn’t done that (if you’re in, say, a storage
driver).

What are you really asking about?

-p

From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of
xxxxx@seagate.com
Sent: Tuesday, January 09, 2007 11:54 AM
To: Windows System Software Devs Interest List
Subject: [ntdev] Parallel process overlapped requests from a single
thread in kernel mode?

Is there a way to concurrently process, inside the kernel, an arbitrary
number of overlapped IOs issued by a single thread? Multiple issuing
threads is easy, because you are in a different thread context for each
one.

Thanks,

Phil

Philip D. Barila
Seagate Technology LLC
(720) 684-1842

— Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256 To unsubscribe, visit the
List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

Yes, a work item or DPC would be the obvious way. If there isn’t a DPC
already running, does not queuing the DPC result in an immediate call into
the DPC callback, in the same thread context? (So far, that has been my
observation, but perhaps since I haven’t implemented the IO Completion
Ports necessary to finish the UM part, so there isn’t any real overlapping
going on, perhaps that observation isn’t valid when I get it done… I
expected the fact that I passed a valid LPOVERLAPPED with a valid event to
allow the pending part to work as expected. It works, just not as
overlapped as I thought it would.)

Pending is nice, except that you don’t get to the “pend” part until all
the OS calls you have to make (including the calls to queue the DPC/work
item) return, and if those OS calls result in an inline callback, you
might as well not try to spool the work off, since you aren’t actually
going to “pend” until you’ve done all the work.

Maybe the kernel thread might do it…

In answer to Peter Wieland, what I’m really trying to do is to just what I
said. I am (going to, see above) issuing multiple overlapped IOs from a
single thread, and my observation is that I don’t end up returning to UM
to start the next one until I’ve completely preprocessed the current one,
and I believe it is due to the fact that, in the absence of another
instance of my DPC already running, my DPC is called inline when I queue
it.

If it turns out that the only way to really make that happen is to force
it into a kernel thread, perhaps I’ll revisit the requirement and just
simplify the code a lot. Queuing a DPC that results in an inline callback
is a pretty expensive way to make a function call, no?

Thanks,

Phil

Philip D. Barila
Seagate Technology LLC
(720) 684-1842

From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of “Don Burn”

Sent: Tuesday, January 09, 2007 12:59 PM
To: “Windows System Software Devs Interest List”
Subject: Re:[ntdev] Parallel process overlapped requests from a single
thread in kernel mode?

Well handing the request off to a work item, DPC or kernel thread after
pending it in the dispatch routine is the obvious way. I’ve used this for
inverted calls, where the initialization primes the pump with all the
requests, then the driver completes them as needed.


Don Burn (MVP, Windows DDK)
Windows 2k/XP/2k3 Filesystem and Driver Consulting
http://www.windrvr.com
Remove StopSpam from the email to reply

< xxxxx@seagate.com> wrote in message news:xxxxx@ntdev…
> Is there a way to concurrently process, inside the kernel, an arbitrary
> number of overlapped IOs issued by a single thread? Multiple issuing
> threads is easy, because you are in a different thread context for each
> one.
>
> Thanks,
>
> Phil
> Philip D. Barila
> Seagate Technology LLC
> (720) 684-1842
>


Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

You’re correct. A DPC doesn’t do anything to try and move work to
another thread/processor. It runs whenever it can and if you queue a
normal priority DPC from passive level then “whenever it can” is “then”.
You might try using a low-priority DPC … but I think that will get
deferred and I doubt that’s what you want either.

Another alternative would be to allocate a DPC per processor, affinitize
each one to that CPU and then always try to pick a DPC for a different
processor when you go to enqueue one. However that will inevitably
block some other thread who’s trying to dispatch multiple I/Os
concurrently from a single thread.

In the end either you dispatch the I/O on the thread it comes in on, or
you take the context switch hit of moving that work to another thread.

-p

From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of
xxxxx@seagate.com
Sent: Tuesday, January 09, 2007 12:58 PM
To: Windows System Software Devs Interest List
Subject: RE: [ntdev] Parallel process overlapped requests from a single
thread in kernel mode?

Yes, a work item or DPC would be the obvious way. If there isn’t a DPC
already running, does not queuing the DPC result in an immediate call
into the DPC callback, in the same thread context? (So far, that has
been my observation, but perhaps since I haven’t implemented the IO
Completion Ports necessary to finish the UM part, so there isn’t any
real overlapping going on, perhaps that observation isn’t valid when I
get it done… I expected the fact that I passed a valid LPOVERLAPPED
with a valid event to allow the pending part to work as expected. It
works, just not as overlapped as I thought it would.)

Pending is nice, except that you don’t get to the “pend” part until all
the OS calls you have to make (including the calls to queue the DPC/work
item) return, and if those OS calls result in an inline callback, you
might as well not try to spool the work off, since you aren’t actually
going to “pend” until you’ve done all the work.

Maybe the kernel thread might do it…

In answer to Peter Wieland, what I’m really trying to do is to just what
I said. I am (going to, see above) issuing multiple overlapped IOs from
a single thread, and my observation is that I don’t end up returning to
UM to start the next one until I’ve completely preprocessed the current
one, and I believe it is due to the fact that, in the absence of another
instance of my DPC already running, my DPC is called inline when I queue
it.

If it turns out that the only way to really make that happen is to force
it into a kernel thread, perhaps I’ll revisit the requirement and just
simplify the code a lot. Queuing a DPC that results in an inline
callback is a pretty expensive way to make a function call, no?

Thanks,

Phil

Philip D. Barila
Seagate Technology LLC
(720) 684-1842


From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of “Don Burn”

Sent: Tuesday, January 09, 2007 12:59 PM
To: “Windows System Software Devs Interest List”
Subject: Re:[ntdev] Parallel process overlapped requests from a single
thread in kernel mode?

Well handing the request off to a work item, DPC or kernel thread after
pending it in the dispatch routine is the obvious way. I’ve used this
for
inverted calls, where the initialization primes the pump with all the
requests, then the driver completes them as needed.


Don Burn (MVP, Windows DDK)
Windows 2k/XP/2k3 Filesystem and Driver Consulting
http://www.windrvr.com http:</http:>
Remove StopSpam from the email to reply

< xxxxx@seagate.com> wrote in message news:xxxxx@ntdev…
> Is there a way to concurrently process, inside the kernel, an
arbitrary
> number of overlapped IOs issued by a single thread? Multiple issuing
> threads is easy, because you are in a different thread context for
each
> one.
>
> Thanks,
>
> Phil
> Philip D. Barila
> Seagate Technology LLC
> (720) 684-1842
>


Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer — Questions? First
check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256 To unsubscribe, visit the
List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

Yes, this is a common wrinkle in trying to be a good, asynchronous citizen. Your analysis of what happens with the DPC is correct, and you’ll have the same issue with the O/S work item package but for a different reason:

By default, threads run with a priority of 7-9ish, but the O/S worker queue threads run at priorities that are higher in the dynamic range. IIRC it’s currently:

Delayed queue - priority 12 (i.e. NotSoDelayed queue)
Normal queue - priority 13
Hypercriitical - priority 15

So, you queue your work item from the priority 8 thread, the event the priority 13 thread is waiting on becomes signaled, and 13 is higher than 8 so the current thread is yanked off the CPU, the worker thread is put on, and your work item runs immediately. Same result as you’re seeing now, except this time it’s caused by thread priority and not IRQL.

Typically the suggestion for fixing this is creating your own work queue package so that you have control over the priority of the workers. It’s fairly minimal effort and usually turns out to be the sort of thing that is generally useful.

-scott


Scott Noone
Software Engineer
OSR Open Systems Resources, Inc.
http://www.osronline.com

wrote in message news:xxxxx@ntdev…

Yes, a work item or DPC would be the obvious way. If there isn’t a DPC already running, does not queuing the DPC result in an immediate call into the DPC callback, in the same thread context? (So far, that has been my observation, but perhaps since I haven’t implemented the IO Completion Ports necessary to finish the UM part, so there isn’t any real overlapping going on, perhaps that observation isn’t valid when I get it done… I expected the fact that I passed a valid LPOVERLAPPED with a valid event to allow the pending part to work as expected. It works, just not as overlapped as I thought it would.)

Pending is nice, except that you don’t get to the “pend” part until all the OS calls you have to make (including the calls to queue the DPC/work item) return, and if those OS calls result in an inline callback, you might as well not try to spool the work off, since you aren’t actually going to “pend” until you’ve done all the work.

Maybe the kernel thread might do it…

In answer to Peter Wieland, what I’m really trying to do is to just what I said. I am (going to, see above) issuing multiple overlapped IOs from a single thread, and my observation is that I don’t end up returning to UM to start the next one until I’ve completely preprocessed the current one, and I believe it is due to the fact that, in the absence of another instance of my DPC already running, my DPC is called inline when I queue it.

If it turns out that the only way to really make that happen is to force it into a kernel thread, perhaps I’ll revisit the requirement and just simplify the code a lot. Queuing a DPC that results in an inline callback is a pretty expensive way to make a function call, no?

Thanks,

Phil

Philip D. Barila
Seagate Technology LLC
(720) 684-1842

------------------------------------------------------------------------------
From: xxxxx@lists.osr.com [mailto:xxxxx@lists.osr.com] On Behalf Of “Don Burn”
Sent: Tuesday, January 09, 2007 12:59 PM
To: “Windows System Software Devs Interest List”
Subject: Re:[ntdev] Parallel process overlapped requests from a single thread in kernel mode?

Well handing the request off to a work item, DPC or kernel thread after
pending it in the dispatch routine is the obvious way. I’ve used this for
inverted calls, where the initialization primes the pump with all the
requests, then the driver completes them as needed.


Don Burn (MVP, Windows DDK)
Windows 2k/XP/2k3 Filesystem and Driver Consulting
http://www.windrvr.com
Remove StopSpam from the email to reply

< xxxxx@seagate.com> wrote in message news:xxxxx@ntdev…
> Is there a way to concurrently process, inside the kernel, an arbitrary
> number of overlapped IOs issued by a single thread? Multiple issuing
> threads is easy, because you are in a different thread context for each
> one.
>
> Thanks,
>
> Phil
> Philip D. Barila
> Seagate Technology LLC
> (720) 684-1842
>


Questions? First check the Kernel Driver FAQ at http://www.osronline.com/article.cfm?id=256

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer

Scott Noone wrote:

So, you queue your work item from the priority 8 thread, the event the
priority 13 thread is waiting on becomes signaled, and 13 is higher
than 8 so the current thread is yanked off the CPU, the worker thread
is put on, and your work item runs immediately. Same result as you’re
seeing now, except this time it’s caused by thread priority and not IRQL.

And, with any luck, that thread will immediately wake up, block on some
mutex that you already have locked in the lower-priority thread (or,
even worse, a critical section with a spin count), and re-schedule the
original thread (or some other, equal-priority thread). This leads to
bad thread thrashing, and is quite a cache and scheduling nuisance –
even worse (because of the spin count) on multi-CPU systems.

Out of education: how does the Windows kernel deal with this?

Back in the day, on another kernel, what we did was provide a flag on
the signalling primitives that adviced the scheduler to not re-evaluate
priorities yet. If properly used, then the lower-priority thread would
run until it reached a true blocking point (which was very common), or
the pre-emption quanta was reached. Of course, the quanta we used was 3
milliseconds, so latencies were pretty good anyway.

Cheers,

/ h+

> Yes, a work item or DPC would be the obvious way.

I think you must first determine the degree of parallelism you want, and
possible parallelism limiters.

If the parallelism limiter is the CPU performance only - then create a worker
thread per-CPU and offload all work to them. You will not be able to achieve
better results.

But, if the parallelism limiter is waiting for some events (IO-bound usually)
to complete
, which is the usual thing in the OS kernel - then another
approaches are better. For instance, to split the operation to non-IO-bound
parts, divided by waits, and run these parts in parallel - so that the part 1
of the operation 2 runs on a CPU while the operation 1 is waits for some event.


Maxim Shatskih, Windows DDK MVP
StorageCraft Corporation
xxxxx@storagecraft.com
http://www.storagecraft.com