Crash dump execution environment

OSR_Community_User · October 18, 2007, 4:43pm

We have our crash dump driver working, but just want to confirm some of my
understanding about the execution environment.

When a crash dump miniport is processing crash dump writes (and any other
miniport requests):

the processor is at IRQL HIGH_LEVEL
all other processors are stopped (looping at HIGH_LEVEL?)
devices on the PCI bus have not been turned off yet (via clearing
the PCI command register or asking other drivers to stop or power down,
unless they have explicitly asked for a crash dump callback and turned off
the hardware)
the state of the interrupt controller is? There were some cases
were spinlocks were acquired, lowering the IRQL to DISPATCH_LEVEL, which I
believe is really bad if interrupts are pending, we are fixing this
pretty much EVERY kernel API is bad (except a few like
reading/writing ports/memory), including WMI/WPP tracing calls
all threads except the one doing the crash dump are stopped
(implied if all processors are locked it HIGH_LEVEL)
all faults will be bad, so use of SEH is pointless and doesn’t work
the crash dump ends with a processor reset? (but perhaps not a PCI
bus reset, as our boot bios code sometimes dies after a crash dump and
doesn’t after a normal shutdown??), this implies boot bios code should
assume boot hardware is NOT freshly powered on/hardware bus reset and may be
in an ugly state than needs resetting via software
IRQL == HIGH_LEVEL is never encountered in normal operation, so can
assume if at HIGH_LEVEL we must be crashing (this makes bypassing spinlocks
much easier)
can crash dumps happen when the original fault is at elevated IRQL or
only PASSIVE_LEVEL? (DISPATCH or DIRQL or ???) This matters because if they
never happen at elevated IRQL, our code can help protect the crash dump
driver by raising the IRQL when changing data structures that might be used
by the crash dump driver.
crashing with no dump is much better than risking system disk
corruption
I’ve noticed 1394 debugging doesn’t exactly single step correctly
while in a crash dump, there was recently a comment here that serial
debugging was much better in that case?
we felt that if we were going to boot using our storage device, not
making crash dumps also work was pretty unacceptable
does all this apply to writing hibernation data? Seems like there are
hiber_xxx drivers
We have done all this in Win 2003, what changes about crash dumps in
Win 2008?

So what am I missing?

Jan

OSR_Community_User · October 18, 2007, 5:11pm

Regarding (9): In your miniport adapter driver’s DriverEntry routine:

BOOLEAN IsDumpDriver = FALSE;

ULONG

DriverEntry(

PVOID Argument1,

PVOID Argument2)

{

If (Argument1 == NULL)

{

IsDumpDriver = TRUE;

}

Why this is undocumented remains one of the Many Microsoft Mysteries.

Regarding (5) DebugPrintEx continues to work, although its use is
obscured in miniports.

64bit dumping is slightly different than 32bit dumping regarding how
processors are corralled etc. You need to test both.

(10) crash dumps can occur at elevated IRQL.

From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Jan Bottorff
Sent: Thursday, October 18, 2007 4:43 PM
To: Windows System Software Devs Interest List
Subject: [ntdev] Crash dump execution environment

We have our crash dump driver working, but just want to confirm some of
my understanding about the execution environment.

When a crash dump miniport is processing crash dump writes (and any
other miniport requests):

the processor is at IRQL HIGH_LEVEL
all other processors are stopped (looping at HIGH_LEVEL?)
devices on the PCI bus have not been turned off yet (via
clearing the PCI command register or asking other drivers to stop or
power down, unless they have explicitly asked for a crash dump callback
and turned off the hardware)
the state of the interrupt controller is? There were some cases
were spinlocks were acquired, lowering the IRQL to DISPATCH_LEVEL, which
I believe is really bad if interrupts are pending, we are fixing this
pretty much EVERY kernel API is bad (except a few like
reading/writing ports/memory), including WMI/WPP tracing calls
all threads except the one doing the crash dump are stopped
(implied if all processors are locked it HIGH_LEVEL)
all faults will be bad, so use of SEH is pointless and doesn’t
work
the crash dump ends with a processor reset? (but perhaps not a
PCI bus reset, as our boot bios code sometimes dies after a crash dump
and doesn’t after a normal shutdown??), this implies boot bios code
should assume boot hardware is NOT freshly powered on/hardware bus reset
and may be in an ugly state than needs resetting via software
IRQL == HIGH_LEVEL is never encountered in normal operation, so
can assume if at HIGH_LEVEL we must be crashing (this makes bypassing
spinlocks much easier)
can crash dumps happen when the original fault is at elevated IRQL
or only PASSIVE_LEVEL? (DISPATCH or DIRQL or ???) This matters because
if they never happen at elevated IRQL, our code can help protect the
crash dump driver by raising the IRQL when changing data structures that
might be used by the crash dump driver.
crashing with no dump is much better than risking system disk
corruption
I’ve noticed 1394 debugging doesn’t exactly single step correctly
while in a crash dump, there was recently a comment here that serial
debugging was much better in that case?
we felt that if we were going to boot using our storage device,
not making crash dumps also work was pretty unacceptable
does all this apply to writing hibernation data? Seems like there
are hiber_xxx drivers
We have done all this in Win 2003, what changes about crash dumps
in Win 2008?

So what am I missing?

Jan

NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

OSR_Community_User · October 18, 2007, 5:12pm

From: xxxxx@lists.osr.com [mailto:xxxxx@lists.osr.com] On Behalf Of Jan Bottorff
Sent: Thursday, October 18, 2007 1:43 PM
To: Windows System Software Devs Interest List
Subject: [ntdev] Crash dump execution environment

We have our crash dump driver working, but just want to confirm some of my understanding about the execution environment.

When a crash dump miniport is processing crash dump writes (and any other miniport requests):

the processor is at IRQL HIGH_LEVEL
all other processors are stopped (looping at HIGH_LEVEL?)
devices on the PCI bus have not been turned off yet (via clearing the PCI command register or asking other drivers to stop or power down, unless they have explicitly asked for a crash dump callback and turned off the hardware)
the state of the interrupt controller is? There were some cases were spinlocks were acquired, lowering the IRQL to DISPATCH_LEVEL, which I believe is really bad if interrupts are pending, we are fixing this
pretty much EVERY kernel API is bad (except a few like reading/writing ports/memory), including WMI/WPP tracing calls
all threads except the one doing the crash dump are stopped (implied if all processors are locked it HIGH_LEVEL)
all faults will be bad, so use of SEH is pointless and doesn’t work
the crash dump ends with a processor reset? (but perhaps not a PCI bus reset, as our boot bios code sometimes dies after a crash dump and doesn’t after a normal shutdown??), this implies boot bios code should assume boot hardware is NOT freshly powered on/hardware bus reset and may be in an ugly state than needs resetting via software
IRQL == HIGH_LEVEL is never encountered in normal operation, so can assume if at HIGH_LEVEL we must be crashing (this makes bypassing spinlocks much easier)

I suspect that relying on the current IRQL to tell you if you can acquire a spinlock is going to screw you up. You know during driver entry if you’re in crashdump mode - you should do something there (like replace your global pointer to KeAcquireSpinlock to instead point to a function which does nothing).

can crash dumps happen when the original fault is at elevated IRQL or only PASSIVE_LEVEL? (DISPATCH or DIRQL or ???) This matters because if they never happen at elevated IRQL, our code can help protect the crash dump driver by raising the IRQL when changing data structures that might be used by the crash dump driver.

At any IRQL.

crashing with no dump is much better than risking system disk corruption

Probably. Of course both suck really bad.

I’ve noticed 1394 debugging doesn’t exactly single step correctly while in a crash dump, there was recently a comment here that serial debugging was much better in that case?
we felt that if we were going to boot using our storage device, not making crash dumps also work was pretty unacceptable
does all this apply to writing hibernation data? Seems like there are hiber_xxx drivers

Hibernate and crashdump use a similar mechanism, so if you can figure out how to get loaded as hiber_xxx then your modifications to your driver may well work.

We have done all this in Win 2003, what changes about crash dumps in Win 2008?

So what am I missing?

Jan

NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer

OSR_Community_User · October 18, 2007, 6:04pm

Hi,

I do know in MY driver when I’m crashing. One of the reasons I needed to
pass some data from the normal instance to the dump instance is the routines
that know how to put entries in the hardware I/O queue live in a different
driver, and I need to call them (via a function pointer) with their context
info. One of the things they do is acquire a spinlock before manipulating
the hardware queue.

So what would happen if whenever I was calling those hardware queue
routines, I raised the IRQL level to HIGH_LEVEL if it wasn’t already, and
put it back afterwards. Informing that other piece of code to set a global
flag to bypass spinlocks may require changes to the API for a third-party
protocol stack.

Or could I depend on the IRQL being > DISPATCH_LEVEL? Or is there a flag
that can be linked to via a system DLL that can the queue routines if we are
crashing?

It sounds like if the crash driver is called at < HIGH_LEVEL then interrupts
are masked off?

Jan

From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Peter Wieland
Sent: Thursday, October 18, 2007 2:11 PM
To: Windows System Software Devs Interest List
Subject: RE: [ntdev] Crash dump execution environment

From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Jan Bottorff
Sent: Thursday, October 18, 2007 1:43 PM
To: Windows System Software Devs Interest List
Subject: [ntdev] Crash dump execution environment

We have our crash dump driver working, but just want to confirm some of my
understanding about the execution environment.

When a crash dump miniport is processing crash dump writes (and any other
miniport requests):

the processor is at IRQL HIGH_LEVEL
all other processors are stopped (looping at HIGH_LEVEL?)
devices on the PCI bus have not been turned off yet (via clearing the
PCI command register or asking other drivers to stop or power down, unless
they have explicitly asked for a crash dump callback and turned off the
hardware)
the state of the interrupt controller is? There were some cases were
spinlocks were acquired, lowering the IRQL to DISPATCH_LEVEL, which I
believe is really bad if interrupts are pending, we are fixing this
pretty much EVERY kernel API is bad (except a few like reading/writing
ports/memory), including WMI/WPP tracing calls
all threads except the one doing the crash dump are stopped (implied if
all processors are locked it HIGH_LEVEL)
all faults will be bad, so use of SEH is pointless and doesn’t work
the crash dump ends with a processor reset? (but perhaps not a PCI bus
reset, as our boot bios code sometimes dies after a crash dump and doesn’t
after a normal shutdown??), this implies boot bios code should assume boot
hardware is NOT freshly powered on/hardware bus reset and may be in an ugly
state than needs resetting via software
IRQL == HIGH_LEVEL is never encountered in normal operation, so can
assume if at HIGH_LEVEL we must be crashing (this makes bypassing spinlocks
much easier)

I suspect that relying on the current IRQL to tell you if you can acquire a
spinlock is going to screw you up. You know during driver entry if you’re
in crashdump mode - you should do something there (like replace your global
pointer to KeAcquireSpinlock to instead point to a function which does
nothing).

can crash dumps happen when the original fault is at elevated IRQL or
only PASSIVE_LEVEL? (DISPATCH or DIRQL or ???) This matters because if they
never happen at elevated IRQL, our code can help protect the crash dump
driver by raising the IRQL when changing data structures that might be used
by the crash dump driver.

At any IRQL.

crashing with no dump is much better than risking system disk
corruption

Probably. Of course both suck really bad.

I’ve noticed 1394 debugging doesn’t exactly single step correctly
while in a crash dump, there was recently a comment here that serial
debugging was much better in that case?
we felt that if we were going to boot using our storage device, not
making crash dumps also work was pretty unacceptable
does all this apply to writing hibernation data? Seems like there are
hiber_xxx drivers

Hibernate and crashdump use a similar mechanism, so if you can figure out
how to get loaded as hiber_xxx then your modifications to your driver may
well work.

We have done all this in Win 2003, what changes about crash dumps in
Win 2008?

So what am I missing?

Jan

NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

anton_bassov · October 19, 2007, 12:35am

> 11) crashing with no dump is much better than risking system disk corruption

Indeed, it is…

There is one more point to consider - you don’t always have a chance to dump crash data, because
some errors may reset the CPU right on the spot, so that you don’t always have a chance to dump memory or even to display a “Blue Screen Of Death”. Writing an invalid value to CR0 is the very first example of such error that comes to my mind, but it is far from being the only one - as far as I remember from my experiments, there are quite a few ways to reset the CPU right on the spot. Therefore, you should think of your task as of “best-effort service” that cannot be guaranteed…

Anton Bassov