Actually it is possible to do this transparently, but it is a lot more work
than the approach that Gregory outlines below. The idea is to checkpoint
around IO output operations, with the checkpoint being the entire set of
memory writes between the last checkpoint and the current one.
There are of course other minor details that one has to work out for process
mirroring, stupid stuff like timestamps, handles, etc. that may very well be
different between process images.
Whether you actually repeat the crash condition or not depends on a lot of
variables. The experience we had at Sequoia Systems, where we built
commercial fault tolerant systems that used a checkpoint and rollback
scheme, was that there were a significant number of defects that, due to
timing or other issues, were not repeated on recovery. Also, software
defects are obviously not the only reason for system failure.
-----Original Message-----
From: Gregory G. Dyess [mailto:xxxxx@pdq.net]
Sent: Monday, October 22, 2001 8:27 AM
To: NT Developers Interest List
Subject: [ntdev] RE: Process Mirroring
Unfortunately, you probably also saved the condition that caused the program
to crash in the first place. More than likely, this will lead to repeated
crashes until you clear the current state. The far superior (and only
really workable) solution is to have cooperating but independent processes
passing checkpoint information to the other process in a semi-lock-step
approach. Also, each “Mirror Process” must watchdog the other one to know
when to switch roles. This isn’t always easy to get right and takes
experience in fault-tolerant design. The entire notion of “not modifying an
existing program” and still having the level of fault tolerance you desire
is self-contradictory. Sorry, but you have to design this in. It cannot be
globbed on externally.
Greg
-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com]On Behalf Of sajeev sas
Sent: Monday, October 22, 2001 12:27 AM
To: NT Developers Interest List
Subject: [ntdev] RE: Process Mirroring
Hi Mark,
Though it’s my application, one of the requirements is
to avoid changing the code so as to make it as generic
as possible.
Deviating from my old approach, Instead of always
running two copies of the same process, another
approach is to inject a dll into the target process
and by using a timer save the context+other info in a
shared buffer. This can be used to recreate the
process if it crashed. In this approach only one copy
of the process is running at a time and also context
saving doesn’t happen for every instruction.
Thanks,
Sajeev.
You are currently subscribed to ntdev as: xxxxx@stratus.com To
unsubscribe send a blank email to leave-ntdev-$subst(‘Recip.MemberIDChar’)@lists.osr.com
You are currently subscribed to ntdev as: $subst(‘Recip.EmailAddr’)
To unsubscribe send a blank email to leave-ntdev-$subst(‘Recip.MemberIDChar’)@lists.osr.com