Windows 8 bugcheck 0x9F when sleep, remove hardware, wake

Hi all,

I have a problem where the following sequence of events results in bugcheck 0x9F (DRIVER_POWER_STATE_FAILURE) in Windows 8 (or 8.1):

  1. With our PCI Express card present, install our WDM driver.
  2. Put the system to sleep.
  3. Remove our PCI Express card.
  4. Wake the system using the power button.

As the system begins to power up, our driver goes through the usual motions with handling power management IRPs. We see that the device IRP_MN_SET_POWER is failed with STATUS_NO_SUCH_DEVICE, so we don’t restore state to the hardware.

As soon as the driver completes the system IRP_MN_SET_POWER IRP, it gets a sequence of IRPs culminating in IRP_MN_SURPRISE_REMOVAL and IRP_MN_REMOVE_DEVICE. All of these are completed OK, as far as I can tell, and driver verifier does not complain about anything that our driver does. We avoid touching our hardware after we see IRP_MN_SURPRISE_REMOVAL.

However, not all is well. Typically, the Windows 8 login screen appears, but the system is frozen. A short time later, bugcheck 0x9F occurs, which usually names pci.sys or acpi.sys as the problem driver, though it is likely that in reality the problem is our driver. The IRP from the bugcheck args is not one that was sent to our driver.

So I guess my questions are:

  1. Can anybody give me any tips about how to debug this, given that neither the bugcheck args nor driver verifier provide any hints as to the underlying cause.

  2. Is it possible that this could be a BIOS bug, or might there be some other driver doing naughty things and poking around in PCIe devices that it shouldn’t touch? For instance, nVidia are known for questionable practices such as detecting hardware from competitors, though the video card in the system in question is actually from AMD.

Thanks in advance for any ideas.
Tom

Does your bios support hot remove? If not, you are doing something unsuported. You can’t yank a PCI device in low power unless the bios can handle it

d

Bent from my phone


From: xxxxx@alpha-data.commailto:xxxxx
Sent: ?4/?3/?2015 8:22 AM
To: Windows System Software Devs Interest Listmailto:xxxxx
Subject: [ntdev] Windows 8 bugcheck 0x9F when sleep, remove hardware, wake

Hi all,

I have a problem where the following sequence of events results in bugcheck 0x9F (DRIVER_POWER_STATE_FAILURE) in Windows 8 (or 8.1):

1. With our PCI Express card present, install our WDM driver.
2. Put the system to sleep.
3. Remove our PCI Express card.
4. Wake the system using the power button.

As the system begins to power up, our driver goes through the usual motions with handling power management IRPs. We see that the device IRP_MN_SET_POWER is failed with STATUS_NO_SUCH_DEVICE, so we don’t restore state to the hardware.

As soon as the driver completes the system IRP_MN_SET_POWER IRP, it gets a sequence of IRPs culminating in IRP_MN_SURPRISE_REMOVAL and IRP_MN_REMOVE_DEVICE. All of these are completed OK, as far as I can tell, and driver verifier does not complain about anything that our driver does. We avoid touching our hardware after we see IRP_MN_SURPRISE_REMOVAL.

However, not all is well. Typically, the Windows 8 login screen appears, but the system is frozen. A short time later, bugcheck 0x9F occurs, which usually names pci.sys or acpi.sys as the problem driver, though it is likely that in reality the problem is our driver. The IRP from the bugcheck args is not one that was sent to our driver.

So I guess my questions are:

1. Can anybody give me any tips about how to debug this, given that neither the bugcheck args nor driver verifier provide any hints as to the underlying cause.

2. Is it possible that this could be a BIOS bug, or might there be some other driver doing naughty things and poking around in PCIe devices that it shouldn’t touch? For instance, nVidia are known for questionable practices such as detecting hardware from competitors, though the video card in the system in question is actually from AMD.

Thanks in advance for any ideas.
Tom


NTDEV is sponsored by OSR

Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer</mailto:xxxxx></mailto:xxxxx>

Doran, that’s a good point, thank you.

The system in question has a normal PC motherboard, so I probably am doing something illegal, because sleep might not power down PCIe completely. I guess I don’t need to bother investigating further for the specific case of sleep, remove, wake.

However, I was originally investigating a problem with an almost identical crash when doing the following:

  1. With our PCI Express card present, install our WDM driver.
  2. Do a hybrid shutdown, which looks like hibernation, as far as the driver is concerned.
  3. Remove our PCI Express card.
  4. Power-on the system using the power button, so that Windows does a fast boot.

Is the case of “hybrid shutdown, remove, fast boot” also likely to be unsupported?

I don’t see the IRP_MN_SURPRISE_REMOVAL or IRP_MN_REMOVE_DEVICE messages in the kernel debugger with this case, but I suspect that it may be because the kernel debugger connection (using 1394) gets “refreshed” and loses some messages.

When you see the power IRP coming back from PCI.SYS with error status, do you propagate the power IRP completion up? Check if you end up forgetting to complete the IRP. Typically, you’d stall the completion (STATUS_MORE_PROCESSING_REQUIRED) to perform your own power up, and then call IoCompleteRequest. If you don’t power up your device, do you still call IoCompleteRequest?

Alex, as far as I know, we are completing the IRPs correctly. Driver verifier is happy. Powering up and down works correctly when we don’t remove our card.

What seems happen is that something else, pci.sys or acpi.sys, fails to complete a power IRP (and this is an IRP that has nothing to do with our device) when wake up after pulling out our PCIe device. But as Doran has pointed out, it’s probably illegal (and even dangerous) to remove a PCI Epxress card from a normal non-hot-swap motherboard while the system is sleeping. So I think I do not need to investigate that particular case further.

However, when doing a very similar thing, where we hybrid shutdown instead of sleeping, we get essentially the same symptoms (bugcheck 0x9F on a power IRP). So I am now wondering if “hybrid shutdown -> remove hardware -> fast boot” is legal.

9F bugcheck arguments will show you the type of problem and the hanging IRP (if it’s IRP timeout). You can figure out other information about it. You can get that information from a crashdump, or in live debug.

Also, when you unplug a device, the device IDs and their BAR addresses of other devices may change after boot. This may trigger the surprise removal and reenumeration of other PCIe devices, and some of those drivers may not handle that well.

I suppose it’s possible that this being very similiar to a hibernate / remove hardware / power-on cycle from the point of view of drivers, the BIOS might move certain devices to different PCI addresses (i.e. shifting devices down that were previously enumerated after our now-removed PCIe card).

I guess that might result in behaviour such as IRP_MN_STOP_DEVICE & IRP_MN_START_DEVICE either right before or after a bunch of Power IRPs, which could trip an untested corner case in some other driver.

I think I need to find out for sure if (hibernate / hybrid shutdown) -> remove hardware -> (power-on / fast boot) is supported by Windows for PCI/PCIe devices. I came across a forum post by Elias Yakub saying that it is, but maybe it’s a case that isn’t well tested in some drivers.

How could Windows as an OS not support it? It’s not anything the OS has much say over.

What’s weird/difficult/annoying is that you get the power on before the remove notification.

Want to know FUN? Hybrid shutdown and then move you device to another slot. The power on.

You get an add device and start for the “new” device (that is, your device in the new slot) before you get the remove for the device from the old slot. Arrgh.

Peter
OSR
@OSRDrivers

I had assumed that it should be supported too, given that hibernation guarantees putting PCIe into the D3 state (I think?), but you know, sometimes my assumptions turn out to be incorrect. Doesn’t hurt to ask…

I noticed the power IRPs coming before the PnP IRPs. It’s not exactly an intuitive order to get them, but fortunately the set device power-up IRP gets a failure status (from the PDO?), so at least the FDO can know that it shouldn’t try to touch the hardware.

Ok, thanks for all the replies and comments, everyone - I appreciate the help. I guess hibernate + remove PCIe device + wake is a legal sequence, and I need to test this scenario on a bunch of different systems to see if there are any differences in behaviour.

Agreed. I wasn’t trying to be critical, so my apologies if you took it that way. I was trying to make a broader architectural point that didn’t really need to be made, and I didn’t even do a good job in trying to make that point.

I predict that driver devs are going to be seeing a LOT of weird, unexpected, failures do to the way “hybrid shutdown” works. We wrote a driver last year for a piece of hardware in which either ONE device (wich was the master) or TWO devices (one master and one backup) were supported… but NOT more than two. Our test team just LOVED hybrid shutting down the system, moving the device from one slot to another, and restarting the machine: “But there are ONLY TWO DEVICES INSTALLED” the bug report read, “Customers will expect this to work.”

Sigh. I had to fix it. What a PITA.

Peter
OSR
@OSRDrivers

>You get an add device and start for the “new” device (that is, your device in
the new slot) before you get the remove for the device from the old slot. Arrgh.

But you’ll have SURPRISE_REMOVAL before the new device is enumerated.

Are you quite sure? This was a KMDF driver, and I wasn’t handling surprise remove specifically, but I WAS NOT called to release my resources until AFTER the new device was added. Assuming I recall this correctly.

Peter
OSR
@OSRDrivers