Crash dump execution environment

We have our crash dump driver working, but just want to confirm some of my
understanding about the execution environment.

When a crash dump miniport is processing crash dump writes (and any other
miniport requests):

  1. the processor is at IRQL HIGH_LEVEL

  2. all other processors are stopped (looping at HIGH_LEVEL?)

  3. devices on the PCI bus have not been turned off yet (via clearing
    the PCI command register or asking other drivers to stop or power down,
    unless they have explicitly asked for a crash dump callback and turned off
    the hardware)

  4. the state of the interrupt controller is? There were some cases
    were spinlocks were acquired, lowering the IRQL to DISPATCH_LEVEL, which I
    believe is really bad if interrupts are pending, we are fixing this

  5. pretty much EVERY kernel API is bad (except a few like
    reading/writing ports/memory), including WMI/WPP tracing calls

  6. all threads except the one doing the crash dump are stopped
    (implied if all processors are locked it HIGH_LEVEL)

  7. all faults will be bad, so use of SEH is pointless and doesn’t work

  8. the crash dump ends with a processor reset? (but perhaps not a PCI
    bus reset, as our boot bios code sometimes dies after a crash dump and
    doesn’t after a normal shutdown??), this implies boot bios code should
    assume boot hardware is NOT freshly powered on/hardware bus reset and may be
    in an ugly state than needs resetting via software

  9. IRQL == HIGH_LEVEL is never encountered in normal operation, so can
    assume if at HIGH_LEVEL we must be crashing (this makes bypassing spinlocks
    much easier)

  10. can crash dumps happen when the original fault is at elevated IRQL or
    only PASSIVE_LEVEL? (DISPATCH or DIRQL or ???) This matters because if they
    never happen at elevated IRQL, our code can help protect the crash dump
    driver by raising the IRQL when changing data structures that might be used
    by the crash dump driver.

  11. crashing with no dump is much better than risking system disk
    corruption

  12. I’ve noticed 1394 debugging doesn’t exactly single step correctly
    while in a crash dump, there was recently a comment here that serial
    debugging was much better in that case?

  13. we felt that if we were going to boot using our storage device, not
    making crash dumps also work was pretty unacceptable

  14. does all this apply to writing hibernation data? Seems like there are
    hiber_xxx drivers

  15. We have done all this in Win 2003, what changes about crash dumps in
    Win 2008?

So what am I missing?

Jan

Regarding (9): In your miniport adapter driver’s DriverEntry routine:

BOOLEAN IsDumpDriver = FALSE;

ULONG

DriverEntry(

PVOID Argument1,

PVOID Argument2)

{

If (Argument1 == NULL)

{

IsDumpDriver = TRUE;

}

Why this is undocumented remains one of the Many Microsoft Mysteries.

Regarding (5) DebugPrintEx continues to work, although its use is
obscured in miniports.

64bit dumping is slightly different than 32bit dumping regarding how
processors are corralled etc. You need to test both.

(10) crash dumps can occur at elevated IRQL.


From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Jan Bottorff
Sent: Thursday, October 18, 2007 4:43 PM
To: Windows System Software Devs Interest List
Subject: [ntdev] Crash dump execution environment

We have our crash dump driver working, but just want to confirm some of
my understanding about the execution environment.

When a crash dump miniport is processing crash dump writes (and any
other miniport requests):

  1. the processor is at IRQL HIGH_LEVEL

  2. all other processors are stopped (looping at HIGH_LEVEL?)

  3. devices on the PCI bus have not been turned off yet (via
    clearing the PCI command register or asking other drivers to stop or
    power down, unless they have explicitly asked for a crash dump callback
    and turned off the hardware)

  4. the state of the interrupt controller is? There were some cases
    were spinlocks were acquired, lowering the IRQL to DISPATCH_LEVEL, which
    I believe is really bad if interrupts are pending, we are fixing this

  5. pretty much EVERY kernel API is bad (except a few like
    reading/writing ports/memory), including WMI/WPP tracing calls

  6. all threads except the one doing the crash dump are stopped
    (implied if all processors are locked it HIGH_LEVEL)

  7. all faults will be bad, so use of SEH is pointless and doesn’t
    work

  8. the crash dump ends with a processor reset? (but perhaps not a
    PCI bus reset, as our boot bios code sometimes dies after a crash dump
    and doesn’t after a normal shutdown??), this implies boot bios code
    should assume boot hardware is NOT freshly powered on/hardware bus reset
    and may be in an ugly state than needs resetting via software

  9. IRQL == HIGH_LEVEL is never encountered in normal operation, so
    can assume if at HIGH_LEVEL we must be crashing (this makes bypassing
    spinlocks much easier)

  10. can crash dumps happen when the original fault is at elevated IRQL
    or only PASSIVE_LEVEL? (DISPATCH or DIRQL or ???) This matters because
    if they never happen at elevated IRQL, our code can help protect the
    crash dump driver by raising the IRQL when changing data structures that
    might be used by the crash dump driver.

  11. crashing with no dump is much better than risking system disk
    corruption

  12. I’ve noticed 1394 debugging doesn’t exactly single step correctly
    while in a crash dump, there was recently a comment here that serial
    debugging was much better in that case?

  13. we felt that if we were going to boot using our storage device,
    not making crash dumps also work was pretty unacceptable

  14. does all this apply to writing hibernation data? Seems like there
    are hiber_xxx drivers

  15. We have done all this in Win 2003, what changes about crash dumps
    in Win 2008?

So what am I missing?

Jan


NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

From: xxxxx@lists.osr.com [mailto:xxxxx@lists.osr.com] On Behalf Of Jan Bottorff
Sent: Thursday, October 18, 2007 1:43 PM
To: Windows System Software Devs Interest List
Subject: [ntdev] Crash dump execution environment

We have our crash dump driver working, but just want to confirm some of my understanding about the execution environment.

When a crash dump miniport is processing crash dump writes (and any other miniport requests):

  1. the processor is at IRQL HIGH_LEVEL
  2. all other processors are stopped (looping at HIGH_LEVEL?)
  3. devices on the PCI bus have not been turned off yet (via clearing the PCI command register or asking other drivers to stop or power down, unless they have explicitly asked for a crash dump callback and turned off the hardware)
  4. the state of the interrupt controller is? There were some cases were spinlocks were acquired, lowering the IRQL to DISPATCH_LEVEL, which I believe is really bad if interrupts are pending, we are fixing this
  5. pretty much EVERY kernel API is bad (except a few like reading/writing ports/memory), including WMI/WPP tracing calls
  6. all threads except the one doing the crash dump are stopped (implied if all processors are locked it HIGH_LEVEL)
  7. all faults will be bad, so use of SEH is pointless and doesn’t work
  8. the crash dump ends with a processor reset? (but perhaps not a PCI bus reset, as our boot bios code sometimes dies after a crash dump and doesn’t after a normal shutdown??), this implies boot bios code should assume boot hardware is NOT freshly powered on/hardware bus reset and may be in an ugly state than needs resetting via software
  9. IRQL == HIGH_LEVEL is never encountered in normal operation, so can assume if at HIGH_LEVEL we must be crashing (this makes bypassing spinlocks much easier)

I suspect that relying on the current IRQL to tell you if you can acquire a spinlock is going to screw you up. You know during driver entry if you’re in crashdump mode - you should do something there (like replace your global pointer to KeAcquireSpinlock to instead point to a function which does nothing).

  1. can crash dumps happen when the original fault is at elevated IRQL or only PASSIVE_LEVEL? (DISPATCH or DIRQL or ???) This matters because if they never happen at elevated IRQL, our code can help protect the crash dump driver by raising the IRQL when changing data structures that might be used by the crash dump driver.

At any IRQL.

  1. crashing with no dump is much better than risking system disk corruption

Probably. Of course both suck really bad.

  1. I’ve noticed 1394 debugging doesn’t exactly single step correctly while in a crash dump, there was recently a comment here that serial debugging was much better in that case?
  2. we felt that if we were going to boot using our storage device, not making crash dumps also work was pretty unacceptable
  3. does all this apply to writing hibernation data? Seems like there are hiber_xxx drivers

Hibernate and crashdump use a similar mechanism, so if you can figure out how to get loaded as hiber_xxx then your modifications to your driver may well work.

  1. We have done all this in Win 2003, what changes about crash dumps in Win 2008?

So what am I missing?

Jan


NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer

Hi,

I do know in MY driver when I’m crashing. One of the reasons I needed to
pass some data from the normal instance to the dump instance is the routines
that know how to put entries in the hardware I/O queue live in a different
driver, and I need to call them (via a function pointer) with their context
info. One of the things they do is acquire a spinlock before manipulating
the hardware queue.

So what would happen if whenever I was calling those hardware queue
routines, I raised the IRQL level to HIGH_LEVEL if it wasn’t already, and
put it back afterwards. Informing that other piece of code to set a global
flag to bypass spinlocks may require changes to the API for a third-party
protocol stack.

Or could I depend on the IRQL being > DISPATCH_LEVEL? Or is there a flag
that can be linked to via a system DLL that can the queue routines if we are
crashing?

It sounds like if the crash driver is called at < HIGH_LEVEL then interrupts
are masked off?

Jan


From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Peter Wieland
Sent: Thursday, October 18, 2007 2:11 PM
To: Windows System Software Devs Interest List
Subject: RE: [ntdev] Crash dump execution environment

From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Jan Bottorff
Sent: Thursday, October 18, 2007 1:43 PM
To: Windows System Software Devs Interest List
Subject: [ntdev] Crash dump execution environment

We have our crash dump driver working, but just want to confirm some of my
understanding about the execution environment.

When a crash dump miniport is processing crash dump writes (and any other
miniport requests):

  1. the processor is at IRQL HIGH_LEVEL

  2. all other processors are stopped (looping at HIGH_LEVEL?)

  3. devices on the PCI bus have not been turned off yet (via clearing the
    PCI command register or asking other drivers to stop or power down, unless
    they have explicitly asked for a crash dump callback and turned off the
    hardware)

  4. the state of the interrupt controller is? There were some cases were
    spinlocks were acquired, lowering the IRQL to DISPATCH_LEVEL, which I
    believe is really bad if interrupts are pending, we are fixing this

  5. pretty much EVERY kernel API is bad (except a few like reading/writing
    ports/memory), including WMI/WPP tracing calls

  6. all threads except the one doing the crash dump are stopped (implied if
    all processors are locked it HIGH_LEVEL)

  7. all faults will be bad, so use of SEH is pointless and doesn’t work

  8. the crash dump ends with a processor reset? (but perhaps not a PCI bus
    reset, as our boot bios code sometimes dies after a crash dump and doesn’t
    after a normal shutdown??), this implies boot bios code should assume boot
    hardware is NOT freshly powered on/hardware bus reset and may be in an ugly
    state than needs resetting via software

  9. IRQL == HIGH_LEVEL is never encountered in normal operation, so can
    assume if at HIGH_LEVEL we must be crashing (this makes bypassing spinlocks
    much easier)

I suspect that relying on the current IRQL to tell you if you can acquire a
spinlock is going to screw you up. You know during driver entry if you’re
in crashdump mode - you should do something there (like replace your global
pointer to KeAcquireSpinlock to instead point to a function which does
nothing).

  1. can crash dumps happen when the original fault is at elevated IRQL or
    only PASSIVE_LEVEL? (DISPATCH or DIRQL or ???) This matters because if they
    never happen at elevated IRQL, our code can help protect the crash dump
    driver by raising the IRQL when changing data structures that might be used
    by the crash dump driver.

At any IRQL.

  1. crashing with no dump is much better than risking system disk
    corruption

Probably. Of course both suck really bad.

  1. I’ve noticed 1394 debugging doesn’t exactly single step correctly
    while in a crash dump, there was recently a comment here that serial
    debugging was much better in that case?

  2. we felt that if we were going to boot using our storage device, not
    making crash dumps also work was pretty unacceptable

  3. does all this apply to writing hibernation data? Seems like there are
    hiber_xxx drivers

Hibernate and crashdump use a similar mechanism, so if you can figure out
how to get loaded as hiber_xxx then your modifications to your driver may
well work.

  1. We have done all this in Win 2003, what changes about crash dumps in
    Win 2008?

So what am I missing?

Jan


NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer


NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

> 11) crashing with no dump is much better than risking system disk corruption

Indeed, it is…

There is one more point to consider - you don’t always have a chance to dump crash data, because
some errors may reset the CPU right on the spot, so that you don’t always have a chance to dump memory or even to display a “Blue Screen Of Death”. Writing an invalid value to CR0 is the very first example of such error that comes to my mind, but it is far from being the only one - as far as I remember from my experiments, there are quite a few ways to reset the CPU right on the spot. Therefore, you should think of your task as of “best-effort service” that cannot be guaranteed…

Anton Bassov