Unexpected RESOURCE_NOT_OWNED on hibernation request

OSR_Community_User · March 18, 2012, 5:30pm

> Hi Jospeh,

>Key here is that you have to try all the accepted “good”
>ways of asking a process to close; if they all fail, then you
>can consider TerminateProcess.

Never said something different,…

>and putting a WM_QUIT message in the queue

Putting this on a foreign process is for sure deadly, but not if you are
prepared to handle this in your own app!
****
Actually, this is very hard to deal with. Suppose you need the message
pump during cleanup? The alternative is to carry around massive amounts
of state in global variables (or CWinApp-derived-class variables)
complicating the coding and maintenance. I’ve seen the results of this,
and they Aren’t Pretty.
*****

>Note that anything that causes a long delay in message processing

Thats why i am using threads, synchronisation and a signaling mechanism to
make sure not to get false positives as less as possible.
****
Threads in your app? And what is a “signaling mechanism”? The choice of
implementation makes a huge difference in the robustness of the app suite
*****

>Registered Window Message requsting emergency shutdown, which
>is semi-isomorphic to WM_CLOSE, except it knows there is no user to
>respond to confirmation

If your application has a own mechanism to handle this its fine, but the
application must be still responding or there will be no arrival of
anykind of message once its message pump, threads or polling queque for
commands is frozen! Here the Application Recovery and Restart (ARR) comes
quite handy or another watchdog who can be of help e.g. a service, but
there you have the session barries, etc,…
****
If the message pump freezes, it means the app is defective in design.
This is something you have to make sure cannot happen. Note that, as you
point out, you cannot use a service to do this. Therefore, there is
nothing to prevent the user from killing your watchdog app. Which is why
my system had mutual auto-restart.

Note that some of the mechanisms which purport to save a program’s state
during a restart either cannot do so, or cannot restore the state they
think they saved, or in fact are saving the very state that caused the
lockup in the first place, none of which are particularly good.
****

>suites of programs that are robust and can make guarantees about
> correctness.

There is simply no
****
Actually, unless someone uses task manager to kill one of my apps, we
guarantee correctness. I spent weeks making sure that worked. Once
TerminateProcess is in the picture, though, there can never be guarantees.
****

i think we can infinitely debate on this “best practises” and “design
concepts” topic and still find no 100% failsafe solution. Its impossible
to handle all kind of failures in a application, because there are things
that can be out of the scope of your app, e.g. the runtime, external
situations, operating system failure, powerloss and and and. IMHO there is
NO 100% safe way to make a application work 100% safe, even by using
transational processes, there is always something that can fail. I am sure
you know who Edward A. Murphy is
****
Been there, done that, could even recover from memory parity errors. One
app reststarted itself about once a day when its heap got corrupted; we
never did find the cause of the heap corruption, but we could and did (I
did) recover from it. I will admit, I had not expected to get a memory
parity error reflected to the process by the kernel, but it did, and I
handled it correctly. The process promised “best attempt” delivery of
messages to its communicating processes, but if there was a failure, both
sender and receiver got a notification of the error. MTBF went from 45
minutes to six weeks (when a campus-wide power failure shut down
everything). I spent on the order of a year working on this project,
which was an OS-critical component which, when it failed, required a
complete reboot to fix. I could guarantee that either every packet got
through or there was a notification of failure.

Yes, if the power fails while the disk directory is being written, you
have potential problems, but a transacted file system (which we had)
catches those as well. So there is a way, and I’ve done it, and therefore
I don’t believe the assertion that it cannot be done.
****

I would like to ask something on DuplicateHandle and handle leaks: What if
i do have a Process A and a Process B and Process A accesses a object
handle in Process B with DuplicateHandle and DUPLICATE_SAME_ACCESS but
Process A dies before Process B can close the handle to the Process A
object handle. Am i right if i say that the kernel object is not freed
unless the Process B releases the handle to object in dead Process A. Isnt
that a “sort” of handle leak? As long as there is a reference count >0 the
object and its memory is alive but the object does not belong to Process
B. Or am i wrong and the handle to the object goes down with Process A and
the duplicate handle on Process B is invalid?
****
Duplicate handles will leak only if the process into which they have been
duplicated keeps running without closing them. ALL handles are closed
when a prorcess terminates. The logic does not care what kind of handle
it is, original or duplicate; it is all maintained by reference counts in
the kernel. In fact, it is frequently the case that once process A
creates a duplicate handle in B that A will terminate. Now there is only
one handle left, the handle in B. When B terminates, if the handle has
not been closed, it will be forced closed. It is not a “sort” of handle
leak; it is not a leak at all. The system is behaving correctly. If B is
using the handle, the handle is validly in use, and must NOT be forced
closed just because A terminated. And the handle DOES belong to process
B. The internal logic does not care in the slightest that process A
*opened* the handle; what the kernel sees is two references to an object,
then one reference to that object. There is no way for it to tell WHICH
reference (original or duplicate) is outstanding, and, frankly, nobody
cares. All that matters is that there is a reference.

If you have a long-term process B, and A creates a duplicate handle in B’s
handle space, and B never closes it, then B is defective. Fix the bug.
There is no concept of “belonging to” at this level, just raw “reference
count”. A handle belongs to the process that can address it. As long as
any one handle remains in use, the objects managed by that handle remain
valid. Closing the “original” handle has ZERO impact on the validity of
any and all duplicates which may have been made of it.

Consider the classic example: using an unnamed pipe to deal with a child
process writing to stdout/stderr. The protocol goes like this:

* Create an anonymous pipe, getting two handles, one for its input side
and one for its output side
* Create an inheritable duplicate of the input handle
* Place the handle value in the process creation block for stdout
* Create an inheritable duplicate of the input handle (a second one!)
* Place the second handle value in the process creation block for stderr
* CreateProcess, specifying that inheritable handles be inherited.
* Close the non-inheritable input handle
* Close the inheritable input handle
* Close the second inheritable input handle
* ReadFile from the output handle until you get ERROR_BROKEN_PIPE,
indicating the child process has closed the handle

Note that the non-inheritable and inheritable input handles must all be
closed so that the ONLY remaining valid handles are in the child process.
When the child process terminates, the result will be that those handles
are forced closed and the ERROR_BROKEN_PIPE will result. Failure to close
these handles (the most common error my students make, even though I give
them code examples on the slides) means the ReadFile will hang because the
output side of the pipe still has valid handles for the input side, even
though the child process has closed.

The reason for two duplicates is this: some apps do not use stderr, so the
first thing they do is close the stderr handle. Suppose the numeric value
of that handle was 12345 (just for discussion). If I put the same handle
value in for stdout and stderr, then when the child process closes stderr,
it closes handle 12345. The kernel does not care that this is also the
value for the stdout handle; it closes the handle. The caller sees broken
pipe; the child process will get an error writing to stdout. So you need
two duplicates. Upon completion of the second DuplicateHandle, there are
three handles to the input side: the original non-inheritable handle and
two inheritable handles. After CreateProcess, there are five handles to
the input side: the three aforementioned parent handles, and the two
handles for stdout and stderr in the child process. Now, let’s say the
child process closes stderr. Now there are four handles. The parent
closes the non-inheritable input handle and the two inheritable input
handles, leaving one valid handle, the child process’s stdout. When that
handle is closed, either by the child process issuing a CloseHandle or the
process just terminating (say, for an error), then there are zero handles
left, and the ReadFile gets the ERROR_BROKEN_PIPE condition. Note that
the parent process no longer possesses ANY handles to the input side of
the pipe, yet the pipe remains perfectly valid, because there is one valid
handle.

I find a lot of confusion about “ownership” issues. Processes do not own
other processes (unless you create a process group, which is a rare
occurrence), and the termination of one process has no consequences in
terms of the kernel forcing termination of any processes it created. The
concept of “parent” and “child” exist for conversational purposes but the
kernel does not care in the slightest. Similarly, processes own threads;
threads do not own threads, and termination of one thread has zero kernel
impact on the status of any threads it created (your program may hang
because it is waiting for something from that recently-deceased thread,
but that represents a bug in your program, not the kernel). Processes own
handles. They do not care if these are “original” handles from
CreateFile, or “duplicate” handles from DuplicateHandle. A handle is a
handle. FAILING to close a handle can have an impact on correctness of a
program (the most common error I see is a student says “It hangs on the
ReadFile call” which is true because there is nothing to read, but the
presence of valid handles to the input side means that there WILL be input
someday, except that the parent process is not going to write to those
handles, but the kernel does not know or care. The code is simply
incorrect as written).

So don’t think that a duplicate handle is a second-class citizen, whose
existence depends on the continuing existence of the original handle. It
is a full-fledged handle, with all the rights thereunto, one of which is
to keep the file or device open until some data is read or written via
that handle. Only when ALL outstanding handles are closed will you get
IRP_MJ_CLOSE in your driver.
joe
****

NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

MBond · March 18, 2012, 7:19pm

MTBF of six weeks is not something to be especially proud of no matter how
cleaver your recovery algorithm was. As we are all fond of pointing out,
your time might have been better spent finding and fixing the real bug, but
facetiousness aside, i wanted to point out that your assertion that a
software only solution can guarantee either an outgoing message or an error
is patently false. There are some classes of hardware failure that are
either impossible to detect or impossible to recover from or both in
software and the obvious example that comes to mind is incorrect instruction
execution by a CPU. Bus failures and memory failures are also in the
category of conditions not handled by normal hardware or things that the
ASIC / firmware must deal with or (hopefully) detect. What software can do,
is to use persistent storage to recover to a consistent state. This state
may not be the latest one that existed before the failure, but an
appropriately designed transactional system using stable media can recover
from a lot. Use of transmission protocols that include sequence or state
can make failures virtually transparent, but it takes this consistent
application of sound engineering principals on all levels of a system to
bring the MTBF from hours to years.

wrote in message news:xxxxx@ntdev…

Hi Jospeh,

>Key here is that you have to try all the accepted “good”
>ways of asking a process to close; if they all fail, then you
>can consider TerminateProcess.

Never said something different,…

>and putting a WM_QUIT message in the queue

Putting this on a foreign process is for sure deadly, but not if you are
prepared to handle this in your own app!
****
Actually, this is very hard to deal with. Suppose you need the message
pump during cleanup? The alternative is to carry around massive amounts
of state in global variables (or CWinApp-derived-class variables)
complicating the coding and maintenance. I’ve seen the results of this,
and they Aren’t Pretty.
*****

>Note that anything that causes a long delay in message processing

Thats why i am using threads, synchronisation and a signaling mechanism to
make sure not to get false positives as less as possible.
****
Threads in your app? And what is a “signaling mechanism”? The choice of
implementation makes a huge difference in the robustness of the app suite
*****

>Registered Window Message requsting emergency shutdown, which
>is semi-isomorphic to WM_CLOSE, except it knows there is no user to
>respond to confirmation

If your application has a own mechanism to handle this its fine, but the
application must be still responding or there will be no arrival of
anykind of message once its message pump, threads or polling queque for
commands is frozen! Here the Application Recovery and Restart (ARR) comes
quite handy or another watchdog who can be of help e.g. a service, but
there you have the session barries, etc,…
****
If the message pump freezes, it means the app is defective in design.
This is something you have to make sure cannot happen. Note that, as you
point out, you cannot use a service to do this. Therefore, there is
nothing to prevent the user from killing your watchdog app. Which is why
my system had mutual auto-restart.

Note that some of the mechanisms which purport to save a program’s state
during a restart either cannot do so, or cannot restore the state they
think they saved, or in fact are saving the very state that caused the
lockup in the first place, none of which are particularly good.
****

>suites of programs that are robust and can make guarantees about
> correctness.

There is simply no
****
Actually, unless someone uses task manager to kill one of my apps, we
guarantee correctness. I spent weeks making sure that worked. Once
TerminateProcess is in the picture, though, there can never be guarantees.
****

i think we can infinitely debate on this “best practises” and “design
concepts” topic and still find no 100% failsafe solution. Its impossible
to handle all kind of failures in a application, because there are things
that can be out of the scope of your app, e.g. the runtime, external
situations, operating system failure, powerloss and and and. IMHO there is
NO 100% safe way to make a application work 100% safe, even by using
transational processes, there is always something that can fail. I am sure
you know who Edward A. Murphy is
****
Been there, done that, could even recover from memory parity errors. One
app reststarted itself about once a day when its heap got corrupted; we
never did find the cause of the heap corruption, but we could and did (I
did) recover from it. I will admit, I had not expected to get a memory
parity error reflected to the process by the kernel, but it did, and I
handled it correctly. The process promised “best attempt” delivery of
messages to its communicating processes, but if there was a failure, both
sender and receiver got a notification of the error. MTBF went from 45
minutes to six weeks (when a campus-wide power failure shut down
everything). I spent on the order of a year working on this project,
which was an OS-critical component which, when it failed, required a
complete reboot to fix. I could guarantee that either every packet got
through or there was a notification of failure.

Yes, if the power fails while the disk directory is being written, you
have potential problems, but a transacted file system (which we had)
catches those as well. So there is a way, and I’ve done it, and therefore
I don’t believe the assertion that it cannot be done.
****

I would like to ask something on DuplicateHandle and handle leaks: What if
i do have a Process A and a Process B and Process A accesses a object
handle in Process B with DuplicateHandle and DUPLICATE_SAME_ACCESS but
Process A dies before Process B can close the handle to the Process A
object handle. Am i right if i say that the kernel object is not freed
unless the Process B releases the handle to object in dead Process A. Isnt
that a “sort” of handle leak? As long as there is a reference count >0 the
object and its memory is alive but the object does not belong to Process
B. Or am i wrong and the handle to the object goes down with Process A and
the duplicate handle on Process B is invalid?
****
Duplicate handles will leak only if the process into which they have been
duplicated keeps running without closing them. ALL handles are closed
when a prorcess terminates. The logic does not care what kind of handle
it is, original or duplicate; it is all maintained by reference counts in
the kernel. In fact, it is frequently the case that once process A
creates a duplicate handle in B that A will terminate. Now there is only
one handle left, the handle in B. When B terminates, if the handle has
not been closed, it will be forced closed. It is not a “sort” of handle
leak; it is not a leak at all. The system is behaving correctly. If B is
using the handle, the handle is validly in use, and must NOT be forced
closed just because A terminated. And the handle DOES belong to process
B. The internal logic does not care in the slightest that process A
*opened* the handle; what the kernel sees is two references to an object,
then one reference to that object. There is no way for it to tell WHICH
reference (original or duplicate) is outstanding, and, frankly, nobody
cares. All that matters is that there is a reference.

If you have a long-term process B, and A creates a duplicate handle in B’s
handle space, and B never closes it, then B is defective. Fix the bug.
There is no concept of “belonging to” at this level, just raw “reference
count”. A handle belongs to the process that can address it. As long as
any one handle remains in use, the objects managed by that handle remain
valid. Closing the “original” handle has ZERO impact on the validity of
any and all duplicates which may have been made of it.

Consider the classic example: using an unnamed pipe to deal with a child
process writing to stdout/stderr. The protocol goes like this:

* Create an anonymous pipe, getting two handles, one for its input side
and one for its output side
* Create an inheritable duplicate of the input handle
* Place the handle value in the process creation block for stdout
* Create an inheritable duplicate of the input handle (a second one!)
* Place the second handle value in the process creation block for stderr
* CreateProcess, specifying that inheritable handles be inherited.
* Close the non-inheritable input handle
* Close the inheritable input handle
* Close the second inheritable input handle
* ReadFile from the output handle until you get ERROR_BROKEN_PIPE,
indicating the child process has closed the handle

Note that the non-inheritable and inheritable input handles must all be
closed so that the ONLY remaining valid handles are in the child process.
When the child process terminates, the result will be that those handles
are forced closed and the ERROR_BROKEN_PIPE will result. Failure to close
these handles (the most common error my students make, even though I give
them code examples on the slides) means the ReadFile will hang because the
output side of the pipe still has valid handles for the input side, even
though the child process has closed.

The reason for two duplicates is this: some apps do not use stderr, so the
first thing they do is close the stderr handle. Suppose the numeric value
of that handle was 12345 (just for discussion). If I put the same handle
value in for stdout and stderr, then when the child process closes stderr,
it closes handle 12345. The kernel does not care that this is also the
value for the stdout handle; it closes the handle. The caller sees broken
pipe; the child process will get an error writing to stdout. So you need
two duplicates. Upon completion of the second DuplicateHandle, there are
three handles to the input side: the original non-inheritable handle and
two inheritable handles. After CreateProcess, there are five handles to
the input side: the three aforementioned parent handles, and the two
handles for stdout and stderr in the child process. Now, let’s say the
child process closes stderr. Now there are four handles. The parent
closes the non-inheritable input handle and the two inheritable input
handles, leaving one valid handle, the child process’s stdout. When that
handle is closed, either by the child process issuing a CloseHandle or the
process just terminating (say, for an error), then there are zero handles
left, and the ReadFile gets the ERROR_BROKEN_PIPE condition. Note that
the parent process no longer possesses ANY handles to the input side of
the pipe, yet the pipe remains perfectly valid, because there is one valid
handle.

I find a lot of confusion about “ownership” issues. Processes do not own
other processes (unless you create a process group, which is a rare
occurrence), and the termination of one process has no consequences in
terms of the kernel forcing termination of any processes it created. The
concept of “parent” and “child” exist for conversational purposes but the
kernel does not care in the slightest. Similarly, processes own threads;
threads do not own threads, and termination of one thread has zero kernel
impact on the status of any threads it created (your program may hang
because it is waiting for something from that recently-deceased thread,
but that represents a bug in your program, not the kernel). Processes own
handles. They do not care if these are “original” handles from
CreateFile, or “duplicate” handles from DuplicateHandle. A handle is a
handle. FAILING to close a handle can have an impact on correctness of a
program (the most common error I see is a student says “It hangs on the
ReadFile call” which is true because there is nothing to read, but the
presence of valid handles to the input side means that there WILL be input
someday, except that the parent process is not going to write to those
handles, but the kernel does not know or care. The code is simply
incorrect as written).

So don’t think that a duplicate handle is a second-class citizen, whose
existence depends on the continuing existence of the original handle. It
is a full-fledged handle, with all the rights thereunto, one of which is
to keep the file or device open until some data is read or written via
that handle. Only when ALL outstanding handles are closed will you get
IRP_MJ_CLOSE in your driver.
joe
****

NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

OSR_Community_User · March 19, 2012, 6:09am

On 3/18/2012 10:29 PM, xxxxx@flounder.com wrote:
[nice lesson in multi-thread pipe handling]
Thanks, Joe!

OSR_Community_User · March 19, 2012, 6:15am

On 3/18/2012 10:29 PM, xxxxx@flounder.com wrote:

MTBF went from 45 minutes to six weeks (when a campus-wide power
failure shut down everything). […]

Yes, if the power fails while the disk directory is being written,
you have potential problems, but a transacted file system (which we
had) catches those as well.

On 3/19/2012 12:19 AM, m wrote:

MTBF of six weeks is not something to be especially proud of no
matter how cleaver your recovery algorithm was.

Not reading properly is not something to be especially proud of, no
matter how clever the reader was.

OSR_Community_User · March 19, 2012, 7:09am

>[nice lesson in multi-thread pipe handling] Thanks, Joe!

Yes, thats was really fine,…

OSR_Community_User · March 19, 2012, 12:33pm

In 1975, an MTBF of more than a day for anything other than a
multimilliondollar mainframe (with resident onsite vendo support) was
essentially unheard of. We had a homebuilt multiprocessor system which
used PDP-11 minicomputers. Today, I expect my systems to run for six
months without a reboot, but 37 years ago this would have been a minor
miracle. Our IBM maiframe crashed at least once a day, because the
University could not afford the service contract with the permanent onsite
service engineer.

There was an interesting lesson here: it doesn’t matter if the software
works if nobody notices it fails. I spent several months trying to find
the bug in a mass of twisted code I had inherited from a graduate student.
At the same time, we were trying to increase the MTBF from 45 minutes,
which included serious hardware debugging (we discovered that if a
rotate-register instruction was active when a DMA request came in, the
contents of the rotated register were scrambled. When we reported this to
Digital Equipment Corporation, there first response was “that must be why
the diagnostics fail” and their second response was “we aren’t going to
tell any of our existing customers because they’d want us to fix it”).

We never did find the reason for the heap corruption, but we always
suspected a double-bit memory error on parity-checked memory. My job was
to make the software work reliably under a variety of failure modes. No
matter what went wrong, I had to recover from it. I/O data errors,
protocol errors, errors from API calls, sender process termination,
receiver process termination, packet checksum errors, this was the central
message switcher process for the whole operating system. The original
code was a listing about a half-inch thick, and assumed that every API
would succeed, and any failure was an unrecoverable fatal error. When I
was done, the listing was four inches thick and ran nonstop. Since we wre
running in a 16-bit address space, heap fragmentation was a problem, so I
had a “heap compaction” algorithm that essentially re-created the heap
from scratch. So when I got the equivalent of 0xC0000005, I just
“compacted” the heap by calling the heap-compaction subroutine.

One of our most infamous problems was the failed-CALL instruction. It
pushed the return address onto the stack, then failed to do the IP
transfer so just executed the next instruction. I was impervious to these
because they inariably led to an instruction fault (usually the equivalent
of 0xC0000005) very shortly thereafter. If this ever happened, I’d
essentially restart the message multiplexor. Since this particular
failure was one-in-ten-to-the-moderately-large probability, we wouldn’t
hit it again for a while. If we saw too many hardware failures ina
processor, it was partitioned out of the pool, and set to running
diagnostics. If the diagnostics ran for N hours without error, the
processor was partitioned in as a “kernel-only” processor. It continued
to run diagnostics, but all the I/O devices on its bus were now available.
If the diagnostics ran for K more hours without a problem, it was
reinstated as a full processor and could run apps.

We had a similar treatment for memory. Pages with too many parity errors
were partitioned out on-the-fly but there was persistent partitioning
across reboots. By looking at the various logs, we could determine what
chip or set of chips was suspect, or even if it was in the bank selection
circuitry.

Many apps worked with strong state-machine protocols for robustness. And
one of our researchers came up with a non-locking *unsafe* algorithm for a
partial-differential-equation solver. The Problem was partitioned into a
4x4 matrix. It was a convolution algorithm, which meant each iteration
computed a value from the adjacent points. So at the boundaries, you
would expect to lock the adjacent partition to compute values that crossed
the partition boundary. Instead, he ignored locking entirely, using
obsolete values for computations. His insight was that since the
algorithm is convergent, using a value some other processor is in the
midst of recomputing did not maater; it just converged a little more
slowly. He got a PhD for this work. But it was based on the same
principle I used: it doesn’t have to be *right* as long as it remains
*correct*.
joe

MTBF of six weeks is not something to be especially proud of no matter how
cleaver your recovery algorithm was. As we are all fond of pointing out,
your time might have been better spent finding and fixing the real bug,
but
facetiousness aside, i wanted to point out that your assertion that a
software only solution can guarantee either an outgoing message or an
error
is patently false. There are some classes of hardware failure that are
either impossible to detect or impossible to recover from or both in
software and the obvious example that comes to mind is incorrect
instruction
execution by a CPU. Bus failures and memory failures are also in the
category of conditions not handled by normal hardware or things that the
ASIC / firmware must deal with or (hopefully) detect. What software can
do,
is to use persistent storage to recover to a consistent state. This
state
may not be the latest one that existed before the failure, but an
appropriately designed transactional system using stable media can recover
from a lot. Use of transmission protocols that include sequence or state
can make failures virtually transparent, but it takes this consistent
application of sound engineering principals on all levels of a system to
bring the MTBF from hours to years.

wrote in message news:xxxxx@ntdev…

> Hi Jospeh,
>
>>Key here is that you have to try all the accepted “good”
>>ways of asking a process to close; if they all fail, then you
>>can consider TerminateProcess.
>
> Never said something different,…
>
>>and putting a WM_QUIT message in the queue
>
> Putting this on a foreign process is for sure deadly, but not if you are
> prepared to handle this in your own app!
****
Actually, this is very hard to deal with. Suppose you need the message
pump during cleanup? The alternative is to carry around massive amounts
of state in global variables (or CWinApp-derived-class variables)
complicating the coding and maintenance. I’ve seen the results of this,
and they Aren’t Pretty.
*****
>
>>Note that anything that causes a long delay in message processing
>
> Thats why i am using threads, synchronisation and a signaling mechanism
> to
> make sure not to get false positives as less as possible.
****
Threads in your app? And what is a “signaling mechanism”? The choice of
implementation makes a huge difference in the robustness of the app suite
*****
>
>>Registered Window Message requsting emergency shutdown, which
>>is semi-isomorphic to WM_CLOSE, except it knows there is no user to
>>respond to confirmation
>
> If your application has a own mechanism to handle this its fine, but the
> application must be still responding or there will be no arrival of
> anykind of message once its message pump, threads or polling queque for
> commands is frozen! Here the Application Recovery and Restart (ARR)
> comes
> quite handy or another watchdog who can be of help e.g. a service, but
> there you have the session barries, etc,…
****
If the message pump freezes, it means the app is defective in design.
This is something you have to make sure cannot happen. Note that, as you
point out, you cannot use a service to do this. Therefore, there is
nothing to prevent the user from killing your watchdog app. Which is why
my system had mutual auto-restart.

Note that some of the mechanisms which purport to save a program’s state
during a restart either cannot do so, or cannot restore the state they
think they saved, or in fact are saving the very state that caused the
lockup in the first place, none of which are particularly good.
****
>
>>suites of programs that are robust and can make guarantees about
>> correctness.
>
> There is simply no
****
Actually, unless someone uses task manager to kill one of my apps, we
guarantee correctness. I spent weeks making sure that worked. Once
TerminateProcess is in the picture, though, there can never be guarantees.
****
>
> i think we can infinitely debate on this “best practises” and “design
> concepts” topic and still find no 100% failsafe solution. Its impossible
> to handle all kind of failures in a application, because there are
> things
> that can be out of the scope of your app, e.g. the runtime, external
> situations, operating system failure, powerloss and and and. IMHO there
> is
> NO 100% safe way to make a application work 100% safe, even by using
> transational processes, there is always something that can fail. I am
> sure
> you know who Edward A. Murphy is
****
Been there, done that, could even recover from memory parity errors. One
app reststarted itself about once a day when its heap got corrupted; we
never did find the cause of the heap corruption, but we could and did (I
did) recover from it. I will admit, I had not expected to get a memory
parity error reflected to the process by the kernel, but it did, and I
handled it correctly. The process promised “best attempt” delivery of
messages to its communicating processes, but if there was a failure, both
sender and receiver got a notification of the error. MTBF went from 45
minutes to six weeks (when a campus-wide power failure shut down
everything). I spent on the order of a year working on this project,
which was an OS-critical component which, when it failed, required a
complete reboot to fix. I could guarantee that either every packet got
through or there was a notification of failure.

Yes, if the power fails while the disk directory is being written, you
have potential problems, but a transacted file system (which we had)
catches those as well. So there is a way, and I’ve done it, and therefore
I don’t believe the assertion that it cannot be done.
****
>
> I would like to ask something on DuplicateHandle and handle leaks: What
> if
> i do have a Process A and a Process B and Process A accesses a object
> handle in Process B with DuplicateHandle and DUPLICATE_SAME_ACCESS but
> Process A dies before Process B can close the handle to the Process A
> object handle. Am i right if i say that the kernel object is not freed
> unless the Process B releases the handle to object in dead Process A.
> Isnt
> that a “sort” of handle leak? As long as there is a reference count >0
> the
> object and its memory is alive but the object does not belong to Process
> B. Or am i wrong and the handle to the object goes down with Process A
> and
> the duplicate handle on Process B is invalid?
****
Duplicate handles will leak only if the process into which they have been
duplicated keeps running without closing them. ALL handles are closed
when a prorcess terminates. The logic does not care what kind of handle
it is, original or duplicate; it is all maintained by reference counts in
the kernel. In fact, it is frequently the case that once process A
creates a duplicate handle in B that A will terminate. Now there is only
one handle left, the handle in B. When B terminates, if the handle has
not been closed, it will be forced closed. It is not a “sort” of handle
leak; it is not a leak at all. The system is behaving correctly. If B is
using the handle, the handle is validly in use, and must NOT be forced
closed just because A terminated. And the handle DOES belong to process
B. The internal logic does not care in the slightest that process A
*opened* the handle; what the kernel sees is two references to an object,
then one reference to that object. There is no way for it to tell WHICH
reference (original or duplicate) is outstanding, and, frankly, nobody
cares. All that matters is that there is a reference.

If you have a long-term process B, and A creates a duplicate handle in B’s
handle space, and B never closes it, then B is defective. Fix the bug.
There is no concept of “belonging to” at this level, just raw “reference
count”. A handle belongs to the process that can address it. As long as
any one handle remains in use, the objects managed by that handle remain
valid. Closing the “original” handle has ZERO impact on the validity of
any and all duplicates which may have been made of it.

Consider the classic example: using an unnamed pipe to deal with a child
process writing to stdout/stderr. The protocol goes like this:

* Create an anonymous pipe, getting two handles, one for its input side
and one for its output side
* Create an inheritable duplicate of the input handle
* Place the handle value in the process creation block for stdout
* Create an inheritable duplicate of the input handle (a second one!)
* Place the second handle value in the process creation block for stderr
* CreateProcess, specifying that inheritable handles be inherited.
* Close the non-inheritable input handle
* Close the inheritable input handle
* Close the second inheritable input handle
* ReadFile from the output handle until you get ERROR_BROKEN_PIPE,
indicating the child process has closed the handle

Note that the non-inheritable and inheritable input handles must all be
closed so that the ONLY remaining valid handles are in the child process.
When the child process terminates, the result will be that those handles
are forced closed and the ERROR_BROKEN_PIPE will result. Failure to close
these handles (the most common error my students make, even though I give
them code examples on the slides) means the ReadFile will hang because the
output side of the pipe still has valid handles for the input side, even
though the child process has closed.

The reason for two duplicates is this: some apps do not use stderr, so the
first thing they do is close the stderr handle. Suppose the numeric value
of that handle was 12345 (just for discussion). If I put the same handle
value in for stdout and stderr, then when the child process closes stderr,
it closes handle 12345. The kernel does not care that this is also the
value for the stdout handle; it closes the handle. The caller sees broken
pipe; the child process will get an error writing to stdout. So you need
two duplicates. Upon completion of the second DuplicateHandle, there are
three handles to the input side: the original non-inheritable handle and
two inheritable handles. After CreateProcess, there are five handles to
the input side: the three aforementioned parent handles, and the two
handles for stdout and stderr in the child process. Now, let’s say the
child process closes stderr. Now there are four handles. The parent
closes the non-inheritable input handle and the two inheritable input
handles, leaving one valid handle, the child process’s stdout. When that
handle is closed, either by the child process issuing a CloseHandle or the
process just terminating (say, for an error), then there are zero handles
left, and the ReadFile gets the ERROR_BROKEN_PIPE condition. Note that
the parent process no longer possesses ANY handles to the input side of
the pipe, yet the pipe remains perfectly valid, because there is one valid
handle.

I find a lot of confusion about “ownership” issues. Processes do not own
other processes (unless you create a process group, which is a rare
occurrence), and the termination of one process has no consequences in
terms of the kernel forcing termination of any processes it created. The
concept of “parent” and “child” exist for conversational purposes but the
kernel does not care in the slightest. Similarly, processes own threads;
threads do not own threads, and termination of one thread has zero kernel
impact on the status of any threads it created (your program may hang
because it is waiting for something from that recently-deceased thread,
but that represents a bug in your program, not the kernel). Processes own
handles. They do not care if these are “original” handles from
CreateFile, or “duplicate” handles from DuplicateHandle. A handle is a
handle. FAILING to close a handle can have an impact on correctness of a
program (the most common error I see is a student says “It hangs on the
ReadFile call” which is true because there is nothing to read, but the
presence of valid handles to the input side means that there WILL be input
someday, except that the parent process is not going to write to those
handles, but the kernel does not know or care. The code is simply
incorrect as written).

So don’t think that a duplicate handle is a second-class citizen, whose
existence depends on the continuing existence of the original handle. It
is a full-fledged handle, with all the rights thereunto, one of which is
to keep the file or device open until some data is read or written via
that handle. Only when ALL outstanding handles are closed will you get
IRP_MJ_CLOSE in your driver.
joe
****
>
> —
> NTDEV is sponsored by OSR
>
> For our schedule of WDF, WDM, debugging and other seminars visit:
> http://www.osr.com/seminars
>
> To unsubscribe, visit the List Server section of OSR Online at
> http://www.osronline.com/page.cfm?name=ListServer
>

NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

Tim_Roberts · March 19, 2012, 1:55pm

xxxxx@flounder.com wrote:

TerminateProcess is a very large and dangerous tool to invoke. It can
lead to all kinds of failures. Yes, you can open a jar of pickles with a
sledgehammer, but you will eat fewer glass slivers if you use a bottle
opener.

That’s a good metaphor. In fact, I am currently embroiled in a
“situation” exactly because of this.

One of my clients has someone using one of my camera drivers. If he
kills the application from Task Manager (which uses TerminateProcess),
the application doesn’t go away. He has to reboot. He blames my
driver. Well, it turns out the application does go away, 4 minutes
later. That tells me there must be an outstanding I/O request somewhere.

This is an AVStream driver. DirectShow opens one file handle for the
filter, and one for each pin. When there is only one pin, things work
fine. The streaming driver quickly consumes all of the outstanding
buffers, so there are no I/O requests. But in this case, the
application is using both the capture pin and the snapshot pin. The
snapshot pin has an outstanding buffer, waiting for a snapshot, which
will never be consumed. That’s an outstanding IRP, and it prevents the
shutdown.

Further, because of the effects of Murphy, the process tries to close
the snapshot file handle first, so there is no way for me to tell that I
am in this dire situation. My AVStream driver doesn’t get an
IRP_MJ_CLEANUP. All I see is that the capture pin runs dry of buffers.

Ugly. My current workaround is to watch for a long period with no fresh
buffers coming in, and when that happens, I release any outstanding
buffers on the other pin. That allows termination to proceed. I’m a
little concerned that this might trigger falsely in an extraordinarily
busy system, but the only side effect would be that the app sees a
series of zero-length frames on its snapshot pin, which should not be fatal.

–
Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

Alex_Grig · March 19, 2012, 3:25pm

Tim,

When a process (or a thread) gets killed or exits, all queued IOs get IoCancelIrp call first. It looks like your hanging IRP is not cancelable, this is why the process is not dying. Closing of the handles is next stage of process rundown.

Alex_Grig · March 19, 2012, 3:40pm

>When a process (or a thread) gets killed or exits, all queued IOs get
IoCancelIrp call first.

It’s also entirely possible that the thread exit path will wait for each IRP to complete before trying to cancel the next IRP in the list. And the first IRP would be the snapshot buffer.

Tim_Roberts · March 19, 2012, 6:20pm

xxxxx@broadcom.com wrote:

When a process (or a thread) gets killed or exits, all queued IOs get IoCancelIrp call first. It looks like your hanging IRP is not cancelable, this is why the process is not dying. Closing of the handles is next stage of process rundown.

You know, it funny that all it takes is a post from someone with a
different point of view to unlock our blockades. I knew these facts,
but I hadn’t thought of it in quite this way.

The sample AVStream drivers make “clones” of the stream buffers, as a
way to add an additional reference so they can build scatter/gather
lists. That does make the original IRP uncancelable. I copied that
architecture (as does virtually every capture driver writer), and it has
worked for 5 years, but it is that feature that causes this termination
deadlock. I don’t need the clone concept for a USB camera, so in the
last two hours, I just redid the capture code without the clones, and
the problem has been eliminated.

I owe you a beer.

–
Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

Gary_Little-3 · March 19, 2012, 6:34pm

It’s called too many trees to see the forest ?

I think we had all done that ?

Gary Little
H (952) 223-1349
C (952) 454-4629
xxxxx@comcast.net

On Mar 19, 2012, at 5:20 PM, Tim Roberts wrote:

xxxxx@broadcom.com wrote:
> When a process (or a thread) gets killed or exits, all queued IOs get IoCancelIrp call first. It looks like your hanging IRP is not cancelable, this is why the process is not dying. Closing of the handles is next stage of process rundown.

You know, it funny that all it takes is a post from someone with a
different point of view to unlock our blockades. I knew these facts,
but I hadn’t thought of it in quite this way.

The sample AVStream drivers make “clones” of the stream buffers, as a
way to add an additional reference so they can build scatter/gather
lists. That does make the original IRP uncancelable. I copied that
architecture (as does virtually every capture driver writer), and it has
worked for 5 years, but it is that feature that causes this termination
deadlock. I don’t need the clone concept for a USB camera, so in the
last two hours, I just redid the capture code without the clones, and
the problem has been eliminated.

I owe you a beer.

–
Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer

OSR_Community_User · March 19, 2012, 7:11pm

Its really interesting to see how this thread moved from a unexpected RESOURCE_NOT_OWNED over best practices in resource management and IO handling to cloned stream buffers

MBond · March 19, 2012, 9:27pm

I hope that you meant not writing properly - as I clearly can’t spell and in
a news group have no compiler to help me with syntax errors

The assessment of reading properly or not, is more complex and I leave that
to others to determine

“Hagen Patzke” wrote in message news:xxxxx@ntdev…

On 3/18/2012 10:29 PM, xxxxx@flounder.com wrote:

MTBF went from 45 minutes to six weeks (when a campus-wide power
failure shut down everything). […]

Yes, if the power fails while the disk directory is being written,
you have potential problems, but a transacted file system (which we
had) catches those as well.

On 3/19/2012 12:19 AM, m wrote:

MTBF of six weeks is not something to be especially proud of no
matter how cleaver your recovery algorithm was.

Not reading properly is not something to be especially proud of, no
matter how clever the reader was.

MBond · March 19, 2012, 9:45pm

With all respect intended, and having worked with PDP-11’s myself, I wasn’t
trying to cast aspersions on what was possible in 1975 or the relative
merits of the work done then. The current state of the art is such that a
mean time between failure of six weeks would be somewhere between
unacceptable and laughable for IT decision makers notwithstanding the fact
that they have multiple incumbent systems whose support personnel would
dream of that kind of reliability.

Being pragmatic, sometimes it is necessary to work around a know bug rather
than find the root cause, but again my wit, such as it might be, has been
lost on this forum - we as a community relentlessly tell outside posters
that rather than work around a problem they should find the root cause and
fix it and my intension was to point out this irony

Unfortunately, I have some bazar penchant for correctly functioning software
and the odd belief that on a finite element machine it ought to be possible.
Rest assured that this unrealized aspiration has caused both my worst
employees to resign and my best to excel - but that is the kind of test that
no PHD can prepare one for

wrote in message news:xxxxx@ntdev…

In 1975, an MTBF of more than a day for anything other than a
multimilliondollar mainframe (with resident onsite vendo support) was
essentially unheard of. We had a homebuilt multiprocessor system which
used PDP-11 minicomputers. Today, I expect my systems to run for six
months without a reboot, but 37 years ago this would have been a minor
miracle. Our IBM maiframe crashed at least once a day, because the
University could not afford the service contract with the permanent onsite
service engineer.

There was an interesting lesson here: it doesn’t matter if the software
works if nobody notices it fails. I spent several months trying to find
the bug in a mass of twisted code I had inherited from a graduate student.
At the same time, we were trying to increase the MTBF from 45 minutes,
which included serious hardware debugging (we discovered that if a
rotate-register instruction was active when a DMA request came in, the
contents of the rotated register were scrambled. When we reported this to
Digital Equipment Corporation, there first response was “that must be why
the diagnostics fail” and their second response was “we aren’t going to
tell any of our existing customers because they’d want us to fix it”).

We never did find the reason for the heap corruption, but we always
suspected a double-bit memory error on parity-checked memory. My job was
to make the software work reliably under a variety of failure modes. No
matter what went wrong, I had to recover from it. I/O data errors,
protocol errors, errors from API calls, sender process termination,
receiver process termination, packet checksum errors, this was the central
message switcher process for the whole operating system. The original
code was a listing about a half-inch thick, and assumed that every API
would succeed, and any failure was an unrecoverable fatal error. When I
was done, the listing was four inches thick and ran nonstop. Since we wre
running in a 16-bit address space, heap fragmentation was a problem, so I
had a “heap compaction” algorithm that essentially re-created the heap
from scratch. So when I got the equivalent of 0xC0000005, I just
“compacted” the heap by calling the heap-compaction subroutine.

One of our most infamous problems was the failed-CALL instruction. It
pushed the return address onto the stack, then failed to do the IP
transfer so just executed the next instruction. I was impervious to these
because they inariably led to an instruction fault (usually the equivalent
of 0xC0000005) very shortly thereafter. If this ever happened, I’d
essentially restart the message multiplexor. Since this particular
failure was one-in-ten-to-the-moderately-large probability, we wouldn’t
hit it again for a while. If we saw too many hardware failures ina
processor, it was partitioned out of the pool, and set to running
diagnostics. If the diagnostics ran for N hours without error, the
processor was partitioned in as a “kernel-only” processor. It continued
to run diagnostics, but all the I/O devices on its bus were now available.
If the diagnostics ran for K more hours without a problem, it was
reinstated as a full processor and could run apps.

We had a similar treatment for memory. Pages with too many parity errors
were partitioned out on-the-fly but there was persistent partitioning
across reboots. By looking at the various logs, we could determine what
chip or set of chips was suspect, or even if it was in the bank selection
circuitry.

Many apps worked with strong state-machine protocols for robustness. And
one of our researchers came up with a non-locking *unsafe* algorithm for a
partial-differential-equation solver. The Problem was partitioned into a
4x4 matrix. It was a convolution algorithm, which meant each iteration
computed a value from the adjacent points. So at the boundaries, you
would expect to lock the adjacent partition to compute values that crossed
the partition boundary. Instead, he ignored locking entirely, using
obsolete values for computations. His insight was that since the
algorithm is convergent, using a value some other processor is in the
midst of recomputing did not maater; it just converged a little more
slowly. He got a PhD for this work. But it was based on the same
principle I used: it doesn’t have to be *right* as long as it remains
*correct*.
joe

MTBF of six weeks is not something to be especially proud of no matter how
cleaver your recovery algorithm was. As we are all fond of pointing out,
your time might have been better spent finding and fixing the real bug,
but
facetiousness aside, i wanted to point out that your assertion that a
software only solution can guarantee either an outgoing message or an
error
is patently false. There are some classes of hardware failure that are
either impossible to detect or impossible to recover from or both in
software and the obvious example that comes to mind is incorrect
instruction
execution by a CPU. Bus failures and memory failures are also in the
category of conditions not handled by normal hardware or things that the
ASIC / firmware must deal with or (hopefully) detect. What software can
do,
is to use persistent storage to recover to a consistent state. This
state
may not be the latest one that existed before the failure, but an
appropriately designed transactional system using stable media can recover
from a lot. Use of transmission protocols that include sequence or state
can make failures virtually transparent, but it takes this consistent
application of sound engineering principals on all levels of a system to
bring the MTBF from hours to years.

wrote in message news:xxxxx@ntdev…

> Hi Jospeh,
>
>>Key here is that you have to try all the accepted “good”
>>ways of asking a process to close; if they all fail, then you
>>can consider TerminateProcess.
>
> Never said something different,…
>
>>and putting a WM_QUIT message in the queue
>
> Putting this on a foreign process is for sure deadly, but not if you are
> prepared to handle this in your own app!
****
Actually, this is very hard to deal with. Suppose you need the message
pump during cleanup? The alternative is to carry around massive amounts
of state in global variables (or CWinApp-derived-class variables)
complicating the coding and maintenance. I’ve seen the results of this,
and they Aren’t Pretty.
*****
>
>>Note that anything that causes a long delay in message processing
>
> Thats why i am using threads, synchronisation and a signaling mechanism
> to
> make sure not to get false positives as less as possible.
****
Threads in your app? And what is a “signaling mechanism”? The choice of
implementation makes a huge difference in the robustness of the app suite
*****
>
>>Registered Window Message requsting emergency shutdown, which
>>is semi-isomorphic to WM_CLOSE, except it knows there is no user to
>>respond to confirmation
>
> If your application has a own mechanism to handle this its fine, but the
> application must be still responding or there will be no arrival of
> anykind of message once its message pump, threads or polling queque for
> commands is frozen! Here the Application Recovery and Restart (ARR)
> comes
> quite handy or another watchdog who can be of help e.g. a service, but
> there you have the session barries, etc,…
****
If the message pump freezes, it means the app is defective in design.
This is something you have to make sure cannot happen. Note that, as you
point out, you cannot use a service to do this. Therefore, there is
nothing to prevent the user from killing your watchdog app. Which is why
my system had mutual auto-restart.

Note that some of the mechanisms which purport to save a program’s state
during a restart either cannot do so, or cannot restore the state they
think they saved, or in fact are saving the very state that caused the
lockup in the first place, none of which are particularly good.
****
>
>>suites of programs that are robust and can make guarantees about
>> correctness.
>
> There is simply no
****
Actually, unless someone uses task manager to kill one of my apps, we
guarantee correctness. I spent weeks making sure that worked. Once
TerminateProcess is in the picture, though, there can never be guarantees.
****
>
> i think we can infinitely debate on this “best practises” and “design
> concepts” topic and still find no 100% failsafe solution. Its impossible
> to handle all kind of failures in a application, because there are
> things
> that can be out of the scope of your app, e.g. the runtime, external
> situations, operating system failure, powerloss and and and. IMHO there
> is
> NO 100% safe way to make a application work 100% safe, even by using
> transational processes, there is always something that can fail. I am
> sure
> you know who Edward A. Murphy is
****
Been there, done that, could even recover from memory parity errors. One
app reststarted itself about once a day when its heap got corrupted; we
never did find the cause of the heap corruption, but we could and did (I
did) recover from it. I will admit, I had not expected to get a memory
parity error reflected to the process by the kernel, but it did, and I
handled it correctly. The process promised “best attempt” delivery of
messages to its communicating processes, but if there was a failure, both
sender and receiver got a notification of the error. MTBF went from 45
minutes to six weeks (when a campus-wide power failure shut down
everything). I spent on the order of a year working on this project,
which was an OS-critical component which, when it failed, required a
complete reboot to fix. I could guarantee that either every packet got
through or there was a notification of failure.

Yes, if the power fails while the disk directory is being written, you
have potential problems, but a transacted file system (which we had)
catches those as well. So there is a way, and I’ve done it, and therefore
I don’t believe the assertion that it cannot be done.
****
>
> I would like to ask something on DuplicateHandle and handle leaks: What
> if
> i do have a Process A and a Process B and Process A accesses a object
> handle in Process B with DuplicateHandle and DUPLICATE_SAME_ACCESS but
> Process A dies before Process B can close the handle to the Process A
> object handle. Am i right if i say that the kernel object is not freed
> unless the Process B releases the handle to object in dead Process A.
> Isnt
> that a “sort” of handle leak? As long as there is a reference count >0
> the
> object and its memory is alive but the object does not belong to Process
> B. Or am i wrong and the handle to the object goes down with Process A
> and
> the duplicate handle on Process B is invalid?
****
Duplicate handles will leak only if the process into which they have been
duplicated keeps running without closing them. ALL handles are closed
when a prorcess terminates. The logic does not care what kind of handle
it is, original or duplicate; it is all maintained by reference counts in
the kernel. In fact, it is frequently the case that once process A
creates a duplicate handle in B that A will terminate. Now there is only
one handle left, the handle in B. When B terminates, if the handle has
not been closed, it will be forced closed. It is not a “sort” of handle
leak; it is not a leak at all. The system is behaving correctly. If B is
using the handle, the handle is validly in use, and must NOT be forced
closed just because A terminated. And the handle DOES belong to process
B. The internal logic does not care in the slightest that process A
*opened* the handle; what the kernel sees is two references to an object,
then one reference to that object. There is no way for it to tell WHICH
reference (original or duplicate) is outstanding, and, frankly, nobody
cares. All that matters is that there is a reference.

If you have a long-term process B, and A creates a duplicate handle in B’s
handle space, and B never closes it, then B is defective. Fix the bug.
There is no concept of “belonging to” at this level, just raw “reference
count”. A handle belongs to the process that can address it. As long as
any one handle remains in use, the objects managed by that handle remain
valid. Closing the “original” handle has ZERO impact on the validity of
any and all duplicates which may have been made of it.

Consider the classic example: using an unnamed pipe to deal with a child
process writing to stdout/stderr. The protocol goes like this:

* Create an anonymous pipe, getting two handles, one for its input side
and one for its output side
* Create an inheritable duplicate of the input handle
* Place the handle value in the process creation block for stdout
* Create an inheritable duplicate of the input handle (a second one!)
* Place the second handle value in the process creation block for stderr
* CreateProcess, specifying that inheritable handles be inherited.
* Close the non-inheritable input handle
* Close the inheritable input handle
* Close the second inheritable input handle
* ReadFile from the output handle until you get ERROR_BROKEN_PIPE,
indicating the child process has closed the handle

Note that the non-inheritable and inheritable input handles must all be
closed so that the ONLY remaining valid handles are in the child process.
When the child process terminates, the result will be that those handles
are forced closed and the ERROR_BROKEN_PIPE will result. Failure to close
these handles (the most common error my students make, even though I give
them code examples on the slides) means the ReadFile will hang because the
output side of the pipe still has valid handles for the input side, even
though the child process has closed.

The reason for two duplicates is this: some apps do not use stderr, so the
first thing they do is close the stderr handle. Suppose the numeric value
of that handle was 12345 (just for discussion). If I put the same handle
value in for stdout and stderr, then when the child process closes stderr,
it closes handle 12345. The kernel does not care that this is also the
value for the stdout handle; it closes the handle. The caller sees broken
pipe; the child process will get an error writing to stdout. So you need
two duplicates. Upon completion of the second DuplicateHandle, there are
three handles to the input side: the original non-inheritable handle and
two inheritable handles. After CreateProcess, there are five handles to
the input side: the three aforementioned parent handles, and the two
handles for stdout and stderr in the child process. Now, let’s say the
child process closes stderr. Now there are four handles. The parent
closes the non-inheritable input handle and the two inheritable input
handles, leaving one valid handle, the child process’s stdout. When that
handle is closed, either by the child process issuing a CloseHandle or the
process just terminating (say, for an error), then there are zero handles
left, and the ReadFile gets the ERROR_BROKEN_PIPE condition. Note that
the parent process no longer possesses ANY handles to the input side of
the pipe, yet the pipe remains perfectly valid, because there is one valid
handle.

I find a lot of confusion about “ownership” issues. Processes do not own
other processes (unless you create a process group, which is a rare
occurrence), and the termination of one process has no consequences in
terms of the kernel forcing termination of any processes it created. The
concept of “parent” and “child” exist for conversational purposes but the
kernel does not care in the slightest. Similarly, processes own threads;
threads do not own threads, and termination of one thread has zero kernel
impact on the status of any threads it created (your program may hang
because it is waiting for something from that recently-deceased thread,
but that represents a bug in your program, not the kernel). Processes own
handles. They do not care if these are “original” handles from
CreateFile, or “duplicate” handles from DuplicateHandle. A handle is a
handle. FAILING to close a handle can have an impact on correctness of a
program (the most common error I see is a student says “It hangs on the
ReadFile call” which is true because there is nothing to read, but the
presence of valid handles to the input side means that there WILL be input
someday, except that the parent process is not going to write to those
handles, but the kernel does not know or care. The code is simply
incorrect as written).

So don’t think that a duplicate handle is a second-class citizen, whose
existence depends on the continuing existence of the original handle. It
is a full-fledged handle, with all the rights thereunto, one of which is
to keep the file or device open until some data is read or written via
that handle. Only when ALL outstanding handles are closed will you get
IRP_MJ_CLOSE in your driver.
joe
****
>
> —
> NTDEV is sponsored by OSR
>
> For our schedule of WDF, WDM, debugging and other seminars visit:
> http://www.osr.com/seminars
>
> To unsubscribe, visit the List Server section of OSR Online at
> http://www.osronline.com/page.cfm?name=ListServer
>

NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

OSR_Community_User · March 20, 2012, 4:04am

On 3/20/2012 2:27 AM, m wrote:

I hope that you meant not writing properly - as I clearly can’t spell
and in a news group have no compiler to help me with syntax errors
Unfortunately “reading properly” is quite correct.

OP said “MTBF went from 45 minutes to six weeks (WHEN A CAMPUS-WIDE
POWER FAILURE SHUT DOWN EVERYTHING)”.
[Capitalization by yours truly.]

=> An MTBF longer than six weekscould not be OBSERVED , due to
circumstances outside of control of the design.

[So if, say, you design a system for an MTBF of 24 months in region of
the earth with low geological activity, but you are unlucky and an
earthquake cuts the main power supply and renders up your battery backup
inoperational, it is still crappily designed? Come off it!]

Sorry for the OT post.

OSR_Community_User · March 25, 2012, 5:18pm

To provide redundancy, and avoid power failures, the CMU campus was
serviced from two different substations, several miles apart, each of them
serviced by different power sources.

During construction, a crane toppled into the valley next to the computer
building, and severed both power feeds, which at that point were three
feet apart.

Some years later, a backhoe digging a ditch near downtown Pittsburgh tore
up an 18" fiber bundle, and shut down all Internet traffic to the East
Coast. There were multiple paths, one via Cleveland and the other via
Boston, and both connections were through that same fiber bundle.

Says something about how to make sure there are no single-point failures…
joe

On 3/20/2012 2:27 AM, m wrote:
> I hope that you meant not writing properly - as I clearly can’t spell
> and in a news group have no compiler to help me with syntax errors
Unfortunately “reading properly” is quite correct.

OP said “MTBF went from 45 minutes to six weeks (WHEN A CAMPUS-WIDE
POWER FAILURE SHUT DOWN EVERYTHING)”.
[Capitalization by yours truly.]

=> An MTBF longer than six weekscould not be OBSERVED , due to
circumstances outside of control of the design.

[So if, say, you design a system for an MTBF of 24 months in region of
the earth with low geological activity, but you are unlucky and an
earthquake cuts the main power supply and renders up your battery backup
inoperational, it is still crappily designed? Come off it!]

Sorry for the OT post.

NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer