On Supporting SURPRISE_REMOVAL and STOP in driver

Hi!

I am writing a Win8 device driver. This driver installs against an ACPI device (and does not use any hardware resources).

Now, since we do not use any hardware resources, being entirely a “soft” device, should I support SURPRISE_REMOVAL and STOP in the driver ? Or would it be okay if were to fail the QueryStop and QuerySurpriseRemove PNP power callbacks ?

Best regards
Sharma

It’s way better to allow stop, as people WILL want to update your driver without rebooting.

Jan

-----Original Message-----
From: xxxxx@lists.osr.com [mailto:xxxxx@lists.osr.com] On Behalf Of xxxxx@gmail.com
Sent: Saturday, September 22, 2012 1:06 AM
To: Windows System Software Devs Interest List
Subject: [ntdev] On Supporting SURPRISE_REMOVAL and STOP in driver

Hi!

I am writing a Win8 device driver. This driver installs against an ACPI device (and does not use any hardware resources).

Now, since we do not use any hardware resources, being entirely a “soft” device, should I support SURPRISE_REMOVAL and STOP in the driver ? Or would it be okay if were to fail the QueryStop and QuerySurpriseRemove PNP power callbacks ?

Best regards
Sharma


NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer

Um, there is no query surprise remove. Query stop is resource rebalancing, not driver unload. For servicing you need to support query remove. If this is a kmdf driver, these pretty much come for free already. Is this a soc driver?

d

debt from my phone


From: Jan Bottorff
Sent: 9/22/2012 1:13 AM
To: Windows System Software Devs Interest List
Subject: RE: [ntdev] On Supporting SURPRISE_REMOVAL and STOP in driver

It’s way better to allow stop, as people WILL want to update your driver without rebooting.

Jan

-----Original Message-----
From: xxxxx@lists.osr.com [mailto:xxxxx@lists.osr.com] On Behalf Of xxxxx@gmail.com
Sent: Saturday, September 22, 2012 1:06 AM
To: Windows System Software Devs Interest List
Subject: [ntdev] On Supporting SURPRISE_REMOVAL and STOP in driver

Hi!

I am writing a Win8 device driver. This driver installs against an ACPI device (and does not use any hardware resources).

Now, since we do not use any hardware resources, being entirely a “soft” device, should I support SURPRISE_REMOVAL and STOP in the driver ? Or would it be okay if were to fail the QueryStop and QuerySurpriseRemove PNP power callbacks ?

Best regards
Sharma


NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer


NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer

@Jan:

Reinstall/update doesn’t use STOP_DEVICE. It goes straight to Remove (with query).
It used to be in Win9x that disabling the device in DevMgr was issuing STOP.

Thank you, Jan and Doron.

Doron: my bad, not QuerySurpriseRemove, just QueryRemove. Thank you for pointing that out. This is indeed a KMDF driver.

I was actually trying to understand if there could ever be a rebalance of resources for a driver that does not actually manage any physical devices. What is the correct/recommended behaviour of a non-core wdf device driver when it receives the following: QUERY_STOP/STOP and SURPRISE_REMOVE irps ? Especially when there is no device to remove?

Thank you and Best regards
Sharma

It makes no sense to actively block stop. In a kmdf driver remove, stop, and power down are all the same code paths. So from that perspective, it is all tested and executed already

d

debt from my phone


From: xxxxx@gmail.com
Sent: 9/22/2012 12:04 PM
To: Windows System Software Devs Interest List
Subject: RE:[ntdev] On Supporting SURPRISE_REMOVAL and STOP in driver

Thank you, Jan and Doron.

Doron: my bad, not QuerySurpriseRemove, just QueryRemove. Thank you for pointing that out. This is indeed a KMDF driver.

I was actually trying to understand if there could ever be a rebalance of resources for a driver that does not actually manage any physical devices. What is the correct/recommended behaviour of a non-core wdf device driver when it receives the following: QUERY_STOP/STOP and SURPRISE_REMOVE irps ? Especially when there is no device to remove?

Thank you and Best regards
Sharma


NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer

Doron,

Sorry for any thread drift, but I have two related questions that I’ve
wondered about for a while…

Is there a *Real World* scenario where a PCI device soldered down on the
system board will ever be surprise removed?

Is there a *Real World* scenario where a PCI device soldered down on the
system board, and with fixed resource assignments (declared in ACPI and set by
BIOS during boot), will ever be stopped for rebalance?

KMDF handles these situations (Thank You!), so it’s not a problem. I’m just
curious if these things ever really happen outside of WHCK.

–John

-------- Original Message --------
Subject: Re: [ntdev] On Supporting SURPRISE_REMOVAL and STOP in driver
From: Doron Holan
To: Windows System Software Devs Interest List
Date: 9/22/2012 3:46 PM

> It makes no sense to actively block stop. In a kmdf driver remove, stop, and
> power down are all the same code paths. So from that perspective, it is all
> tested and executed already
>
> d
>
> debt from my phone
> ------------------------------------------------------------------------------
> From: xxxxx@gmail.com
> Sent: 9/22/2012 12:04 PM
> To: Windows System Software Devs Interest List
> Subject: RE:[ntdev] On Supporting SURPRISE_REMOVAL and STOP in driver
>
> Thank you, Jan and Doron.
>
> Doron: my bad, not QuerySurpriseRemove, just QueryRemove. Thank you for
> pointing that out. This is indeed a KMDF driver.
>
> I was actually trying to understand if there could ever be a rebalance of
> resources for a driver that does not actually manage any physical devices.
> What is the correct/recommended behaviour of a non-core wdf device driver when
> it receives the following: QUERY_STOP/STOP and SURPRISE_REMOVE irps ?
> Especially when there is no device to remove?
>
> Thank you and Best regards
> Sharma

Yes, s.r. also can happen due to software failure. If you report the state as failed, you get a s.r. A filter can do this too. Not sure on the stop, but I think that while windows tries really hard to keep the bios assigned hw settings, I think it can be changed. Not sure how resource balance above would affect this either if upstream a new device showed via hotplug. Jake would know more definitively on both counts.

d

debt from my phone


From: John McNamee
Sent: 9/22/2012 4:32 PM
To: Windows System Software Devs Interest List
Subject: Re: [ntdev] On Supporting SURPRISE_REMOVAL and STOP in driver

Doron,

Sorry for any thread drift, but I have two related questions that I’ve
wondered about for a while…

Is there a *Real World* scenario where a PCI device soldered down on the
system board will ever be surprise removed?

Is there a *Real World* scenario where a PCI device soldered down on the
system board, and with fixed resource assignments (declared in ACPI and set by
BIOS during boot), will ever be stopped for rebalance?

KMDF handles these situations (Thank You!), so it’s not a problem. I’m just
curious if these things ever really happen outside of WHCK.

–John

-------- Original Message --------
Subject: Re: [ntdev] On Supporting SURPRISE_REMOVAL and STOP in driver
From: Doron Holan
To: Windows System Software Devs Interest List
Date: 9/22/2012 3:46 PM

> It makes no sense to actively block stop. In a kmdf driver remove, stop, and
> power down are all the same code paths. So from that perspective, it is all
> tested and executed already
>
> d
>
> debt from my phone
> ------------------------------------------------------------------------------
> From: xxxxx@gmail.com
> Sent: 9/22/2012 12:04 PM
> To: Windows System Software Devs Interest List
> Subject: RE:[ntdev] On Supporting SURPRISE_REMOVAL and STOP in driver
>
> Thank you, Jan and Doron.
>
> Doron: my bad, not QuerySurpriseRemove, just QueryRemove. Thank you for
> pointing that out. This is indeed a KMDF driver.
>
> I was actually trying to understand if there could ever be a rebalance of
> resources for a driver that does not actually manage any physical devices.
> What is the correct/recommended behaviour of a non-core wdf device driver when
> it receives the following: QUERY_STOP/STOP and SURPRISE_REMOVE irps ?
> Especially when there is no device to remove?
>
> Thank you and Best regards
> Sharma


NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer

I had been in a hot finger pointing meeting where a batch of DC regulator
on from a vendor would prematurely fail in the field rendering the
disappearance of the PCI device. Man you don’t want to be in any kind of
meeting like that.

Calvin

And I managed to convince the angry OEM that my driver survived the SR.

On Sat, Sep 22, 2012 at 5:03 PM, Calvin Guan (news) <
xxxxx@gradovec.com> wrote:

I had been in a hot finger pointing meeting where a batch of DC regulator
on from a vendor would prematurely fail in the field rendering the
disappearance of the PCI device. Man you don’t want to be in any kind of
meeting like that.

Calvin

To pass the WHQL tests you need to handle surprise removal. Devices on the motherboard suggest passing the WHQL tests for a whole system.

I do understand the desire to avoid handling surprise removal, it can be hard.

I guess I’m one of those people who thinks drivers should try hard to not crash the OS when hardware failures happen. I know some developers see the hardware through rose colored glasses, and don’t feel you need to cope with malfunctioning hardware. I’ve heard the “my driver never needs to cope because…” reasoning before, and seen hardware do things it’s not supposed to do.

If your hardware is firmware controlled, which a LOT of modern hardware is, then that firmware will occasionally crash, and your interface to the device suddenly becomes undefined. Like for example, say your hardware has a ring index register in its BAR window that you read and then use as an index to an in memory structure, you should be noticing if you read all 0xFF and if so, declare your hardware failed, not use all 0XFF as the index into your ring. A hardware designs that DMAs the ring index into a memory is safer, because if the hardware goes away you can still access ram, likely with its previous values. Writing to a BAR registers that is gone tends to be less dangerous. Hardware registers that have all 0XFF as valid read data are really problematic. I know the PCIe 2.0 spec did not spell out what values would be read during a master/target about, although did say it would be the same as the PCI spec, which did say you get all 0xFFs. The motherboard chipsets that control the root complex also usually did spell out reading 0xFFs on target/master abort.

Handing surprise removal correctly can be pretty tricky, like say you initiated I/O, and you gave the physical addresses of user buffers to the hardware for DMA, and now you can’t control the hardware because the BARs no longer seem to work. Do you cancel those I/Os, betting the hardware has forgotten about the addresses you gave it. If you cancel the requests from the software side, but the hardware was having some transient problem and comes back, you now may find the hardware wants to do DMA to user buffer pages that are no longer locked buffers. Getting this right takes careful hardware and driver cooperation, and correct implementation in hardware of reset semantics. I’ve seen hardware that does not respect the PCI Function reset, and does not respect the PCI command bits, so there was no way from software to assure the device was made safe.

I almost want to see complex hardware be noticing a driver watchdog reset, like it reverts to a guaranteed known idle state if it’s doesn’t hear from the driver for some time. On the other hand, hardware that changes state on its own, like when we freeze the system with the kernel debugger, is annoying too, as we might be debugging something else. You might need a global flag to enable or disable hardware device watchdogs, and each driver would need to disable the device watchdog if the global flag was set.

Jan

-----Original Message-----
From: xxxxx@lists.osr.com [mailto:xxxxx@lists.osr.com] On Behalf Of John McNamee
Sent: Saturday, September 22, 2012 4:32 PM
To: Windows System Software Devs Interest List
Subject: Re: [ntdev] On Supporting SURPRISE_REMOVAL and STOP in driver

Doron,

Sorry for any thread drift, but I have two related questions that I’ve wondered about for a while…

Is there a *Real World* scenario where a PCI device soldered down on the system board will ever be surprise removed?

Is there a *Real World* scenario where a PCI device soldered down on the system board, and with fixed resource assignments (declared in ACPI and set by BIOS during boot), will ever be stopped for rebalance?

KMDF handles these situations (Thank You!), so it’s not a problem. I’m just curious if these things ever really happen outside of WHCK.

–John

-------- Original Message --------
Subject: Re: [ntdev] On Supporting SURPRISE_REMOVAL and STOP in driver
From: Doron Holan
To: Windows System Software Devs Interest List
Date: 9/22/2012 3:46 PM

> It makes no sense to actively block stop. In a kmdf driver remove,
> stop, and power down are all the same code paths. So from that
> perspective, it is all tested and executed already
>
> d
>
> debt from my phone
> ----------------------------------------------------------------------
> --------
> From: xxxxx@gmail.com
> Sent: 9/22/2012 12:04 PM
> To: Windows System Software Devs Interest List
> Subject: RE:[ntdev] On Supporting SURPRISE_REMOVAL and STOP in driver
>
> Thank you, Jan and Doron.
>
> Doron: my bad, not QuerySurpriseRemove, just QueryRemove. Thank you
> for pointing that out. This is indeed a KMDF driver.
>
> I was actually trying to understand if there could ever be a rebalance
> of resources for a driver that does not actually manage any physical devices.
> What is the correct/recommended behaviour of a non-core wdf device
> driver when it receives the following: QUERY_STOP/STOP and SURPRISE_REMOVE irps ?
> Especially when there is no device to remove?
>
> Thank you and Best regards
> Sharma


NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer

if a pci device fell off the bus in the middle of a transaction, the
chipset will issue completion timeout and most of the case will translate
to an NMI.

If it’s a target transaction, there is absolutely nothing sw can do because
the CPU is held while the transaction is not returned…

SR is best effort.

Calvin

On Sat, Sep 22, 2012 at 6:13 PM, Jan Bottorff wrote:

> To pass the WHQL tests you need to handle surprise removal. Devices on the
> motherboard suggest passing the WHQL tests for a whole system.
>
> I do understand the desire to avoid handling surprise removal, it can be
> hard.
>
> I guess I’m one of those people who thinks drivers should try hard to not
> crash the OS when hardware failures happen. I know some developers see the
> hardware through rose colored glasses, and don’t feel you need to cope with
> malfunctioning hardware. I’ve heard the “my driver never needs to cope
> because…” reasoning before, and seen hardware do things it’s not supposed
> to do.
>
> If your hardware is firmware controlled, which a LOT of modern hardware
> is, then that firmware will occasionally crash, and your interface to the
> device suddenly becomes undefined. Like for example, say your hardware has
> a ring index register in its BAR window that you read and then use as an
> index to an in memory structure, you should be noticing if you read all
> 0xFF and if so, declare your hardware failed, not use all 0XFF as the
> index into your ring. A hardware designs that DMAs the ring index into a
> memory is safer, because if the hardware goes away you can still access
> ram, likely with its previous values. Writing to a BAR registers that is
> gone tends to be less dangerous. Hardware registers that have all 0XFF as
> valid read data are really problematic. I know the PCIe 2.0 spec did not
> spell out what values would be read during a master/target about, although
> did say it would be the same as the PCI spec, which did say you get all
> 0xFFs. The motherboard chipsets that control the root complex also usually
> did spell out reading 0xFFs on target/master abort.
>
> Handing surprise removal correctly can be pretty tricky, like say you
> initiated I/O, and you gave the physical addresses of user buffers to the
> hardware for DMA, and now you can’t control the hardware because the BARs
> no longer seem to work. Do you cancel those I/Os, betting the hardware has
> forgotten about the addresses you gave it. If you cancel the requests from
> the software side, but the hardware was having some transient problem and
> comes back, you now may find the hardware wants to do DMA to user buffer
> pages that are no longer locked buffers. Getting this right takes careful
> hardware and driver cooperation, and correct implementation in hardware of
> reset semantics. I’ve seen hardware that does not respect the PCI Function
> reset, and does not respect the PCI command bits, so there was no way from
> software to assure the device was made safe.
>
> I almost want to see complex hardware be noticing a driver watchdog reset,
> like it reverts to a guaranteed known idle state if it’s doesn’t hear from
> the driver for some time. On the other hand, hardware that changes state on
> its own, like when we freeze the system with the kernel debugger, is
> annoying too, as we might be debugging something else. You might need a
> global flag to enable or disable hardware device watchdogs, and each driver
> would need to disable the device watchdog if the global flag was set.
>
> Jan
>
> -----Original Message-----
> From: xxxxx@lists.osr.com [mailto:
> xxxxx@lists.osr.com] On Behalf Of John McNamee
> Sent: Saturday, September 22, 2012 4:32 PM
> To: Windows System Software Devs Interest List
> Subject: Re: [ntdev] On Supporting SURPRISE_REMOVAL and STOP in driver
>
> Doron,
>
> Sorry for any thread drift, but I have two related questions that I’ve
> wondered about for a while…
>
> Is there a Real World scenario where a PCI device soldered down on the
> system board will ever be surprise removed?
>
> Is there a Real World scenario where a PCI device soldered down on the
> system board, and with fixed resource assignments (declared in ACPI and set
> by BIOS during boot), will ever be stopped for rebalance?
>
> KMDF handles these situations (Thank You!), so it’s not a problem. I’m
> just curious if these things ever really happen outside of WHCK.
>
> --John
>
>
> -------- Original Message --------
> Subject: Re: [ntdev] On Supporting SURPRISE_REMOVAL and STOP in driver
> From: Doron Holan
> To: Windows System Software Devs Interest List
> Date: 9/22/2012 3:46 PM
>
> > It makes no sense to actively block stop. In a kmdf driver remove,
> > stop, and power down are all the same code paths. So from that
> > perspective, it is all tested and executed already
> >
> > d
> >
> > debt from my phone
> > ----------------------------------------------------------------------
> > --------
> > From: xxxxx@gmail.com
> > Sent: 9/22/2012 12:04 PM
> > To: Windows System Software Devs Interest List
> > Subject: RE:[ntdev] On Supporting SURPRISE_REMOVAL and STOP in driver
> >
> > Thank you, Jan and Doron.
> >
> > Doron: my bad, not QuerySurpriseRemove, just QueryRemove. Thank you
> > for pointing that out. This is indeed a KMDF driver.
> >
> > I was actually trying to understand if there could ever be a rebalance
> > of resources for a driver that does not actually manage any physical
> devices.
> > What is the correct/recommended behaviour of a non-core wdf device
> > driver when it receives the following: QUERY_STOP/STOP and
> SURPRISE_REMOVE irps ?
> > Especially when there is no device to remove?
> >
> > Thank you and Best regards
> > Sharma
>
>
> —
> NTDEV is sponsored by OSR
>
> For our schedule of WDF, WDM, debugging and other seminars visit:
> http://www.osr.com/seminars
>
> To unsubscribe, visit the List Server section of OSR Online at
> http://www.osronline.com/page.cfm?name=ListServer
>
> —
> NTDEV is sponsored by OSR
>
> For our schedule of WDF, WDM, debugging and other seminars visit:
> http://www.osr.com/seminars
>
> To unsubscribe, visit the List Server section of OSR Online at
> http://www.osronline.com/page.cfm?name=ListServer
>

> To pass the WHQL tests you need to handle surprise removal.
> Devices on the motherboard suggest passing the WHQL tests for
> a whole system.

Absolutely. Drivers have to handle SR and rebalance to pass WHQL, so
implementing them isn’t optional. I never questioned that. Luckily KMDF
makes it easy (certainly compared to WDM).

My question was whether SR and rebalance were basically corner cases for a
specific set of devices. I can’t ignore corner cases, but I want to give them
an appropriate amount of attention. R&D resources are finite, even in large
companies. In this case, I think I’m OK assuming that passing WHQL is a
sufficient test. If I were working on a USB device, I’d probably be very
focused on SR, and might even develop my own stress tests for it.

I know some developers see the hardware through rose colored glasses,
> and don’t feel you need to cope with malfunctioning hardware.

+1

My favorite questions during code reviews are “what happens if we get an
interrupt here?” and “what happens if the firmware dies here?”. Driver
developers need good answers to those questions for every line of code.

–John

-------- Original Message --------
Subject: Re: [ntdev] On Supporting SURPRISE_REMOVAL and STOP in driver
From: Jan Bottorff
To: Windows System Software Devs Interest List
Date: 9/22/2012 8:13 PM

> To pass the WHQL tests you need to handle surprise removal. Devices on the motherboard suggest passing the WHQL tests for a whole system.
>
> I do understand the desire to avoid handling surprise removal, it can be hard.
>
> I guess I’m one of those people who thinks drivers should try hard to not crash the OS when hardware failures happen. I know some developers see the hardware through rose colored glasses, and don’t feel you need to cope with malfunctioning hardware. I’ve heard the “my driver never needs to cope because…” reasoning before, and seen hardware do things it’s not supposed to do.
>
> If your hardware is firmware controlled, which a LOT of modern hardware is, then that firmware will occasionally crash, and your interface to the device suddenly becomes undefined. Like for example, say your hardware has a ring index register in its BAR window that you read and then use as an index to an in memory structure, you should be noticing if you read all 0xFF and if so, declare your hardware failed, not use all 0XFF as the index into your ring. A hardware designs that DMAs the ring index into a memory is safer, because if the hardware goes away you can still access ram, likely with its previous values. Writing to a BAR registers that is gone tends to be less dangerous. Hardware registers that have all 0XFF as valid read data are really problematic. I know the PCIe 2.0 spec did not spell out what values would be read during a master/target about, although did say it would be the same as the PCI spec, which did say you get all 0xFFs. The motherboard chipsets that
control
the root complex also usually did spell out reading 0xFFs on target/master abort.
>
> Handing surprise removal correctly can be pretty tricky, like say you initiated I/O, and you gave the physical addresses of user buffers to the hardware for DMA, and now you can’t control the hardware because the BARs no longer seem to work. Do you cancel those I/Os, betting the hardware has forgotten about the addresses you gave it. If you cancel the requests from the software side, but the hardware was having some transient problem and comes back, you now may find the hardware wants to do DMA to user buffer pages that are no longer locked buffers. Getting this right takes careful hardware and driver cooperation, and correct implementation in hardware of reset semantics. I’ve seen hardware that does not respect the PCI Function reset, and does not respect the PCI command bits, so there was no way from software to assure the device was made safe.
>
> I almost want to see complex hardware be noticing a driver watchdog reset, like it reverts to a guaranteed known idle state if it’s doesn’t hear from the driver for some time. On the other hand, hardware that changes state on its own, like when we freeze the system with the kernel debugger, is annoying too, as we might be debugging something else. You might need a global flag to enable or disable hardware device watchdogs, and each driver would need to disable the device watchdog if the global flag was set.
>
> Jan
>
> -----Original Message-----
> From: xxxxx@lists.osr.com [mailto:xxxxx@lists.osr.com] On Behalf Of John McNamee
> Sent: Saturday, September 22, 2012 4:32 PM
> To: Windows System Software Devs Interest List
> Subject: Re: [ntdev] On Supporting SURPRISE_REMOVAL and STOP in driver
>
> Doron,
>
> Sorry for any thread drift, but I have two related questions that I’ve wondered about for a while…
>
> Is there a Real World scenario where a PCI device soldered down on the system board will ever be surprise removed?
>
> Is there a Real World scenario where a PCI device soldered down on the system board, and with fixed resource assignments (declared in ACPI and set by BIOS during boot), will ever be stopped for rebalance?
>
> KMDF handles these situations (Thank You!), so it’s not a problem. I’m just curious if these things ever really happen outside of WHCK.
>
> --John
>
>
> -------- Original Message --------
> Subject: Re: [ntdev] On Supporting SURPRISE_REMOVAL and STOP in driver
> From: Doron Holan
> To: Windows System Software Devs Interest List
> Date: 9/22/2012 3:46 PM
>
>> It makes no sense to actively block stop. In a kmdf driver remove,
>> stop, and power down are all the same code paths. So from that
>> perspective, it is all tested and executed already
>>
>> d
>>
>> debt from my phone
>> ----------------------------------------------------------------------
>> --------
>> From: xxxxx@gmail.com
>> Sent: 9/22/2012 12:04 PM
>> To: Windows System Software Devs Interest List
>> Subject: RE:[ntdev] On Supporting SURPRISE_REMOVAL and STOP in driver
>>
>> Thank you, Jan and Doron.
>>
>> Doron: my bad, not QuerySurpriseRemove, just QueryRemove. Thank you
>> for pointing that out. This is indeed a KMDF driver.
>>
>> I was actually trying to understand if there could ever be a rebalance
>> of resources for a driver that does not actually manage any physical devices.
>> What is the correct/recommended behaviour of a non-core wdf device
>> driver when it receives the following: QUERY_STOP/STOP and SURPRISE_REMOVE irps ?
>> Especially when there is no device to remove?
>>
>> Thank you and Best regards
>> Sharma

Some of the worst failures I’ve seen is when the hardware doesn’t actually
work. UARTs on the old ISA bus frequently generated spurious interrupts
(which I had not been prepared for). I once worked on a system in which
the printer failed to interrupt after successfully printing a line (note
this printer was one of those massive things that connected to a
mainframe). We had to add a timeout, which in that OS was a notrivial
problem, since it didn’t support concepts like timer DPCs.

So it is never safe to assume that hardware works. Or has an infinite
life and will work into the indefinite future. The problem is trying to
figure out what has gone wrong and how you can report it meaningfully (I
spent some months writing CPU diagnostics back in 1976).

I remember one device where we were having problems, and I determined just
by looking at the printer output that the high-order bit was stuck high on
the printer channel. And indeed it was; the prototype motherboards had
only seven of the necessary eight pull-up resistors on the printer
interface (standard old parallel printer port connector). And it wasn’t
that one had been omitted; the traces that would have been required did
not exist! The driver had been developed on a rev board that had all
eight pullups.

I think all the serious driver writers have similar horror stories; you
can’t help but hit these devices every couple years. I’ve never had to do
it, but I’ve been told stories of people who had to examine carefully the
traces of thousands of PCI bus cycles to show that no command had been
issued to the device, and therefore the misbehavior was in the device, not
the driver.
joe

> To pass the WHQL tests you need to handle surprise removal.
> Devices on the motherboard suggest passing the WHQL tests for
> a whole system.

Absolutely. Drivers have to handle SR and rebalance to pass WHQL, so
implementing them isn’t optional. I never questioned that. Luckily KMDF
makes it easy (certainly compared to WDM).

My question was whether SR and rebalance were basically corner cases for a
specific set of devices. I can’t ignore corner cases, but I want to give
them
an appropriate amount of attention. R&D resources are finite, even in
large
companies. In this case, I think I’m OK assuming that passing WHQL is a
sufficient test. If I were working on a USB device, I’d probably be very
focused on SR, and might even develop my own stress tests for it.

> I know some developers see the hardware through rose colored glasses,
> and don’t feel you need to cope with malfunctioning hardware.

+1

My favorite questions during code reviews are “what happens if we get an
interrupt here?” and “what happens if the firmware dies here?”. Driver
developers need good answers to those questions for every line of code.

–John

-------- Original Message --------
Subject: Re: [ntdev] On Supporting SURPRISE_REMOVAL and STOP in driver
From: Jan Bottorff
> To: Windows System Software Devs Interest List
> Date: 9/22/2012 8:13 PM
>
>> To pass the WHQL tests you need to handle surprise removal. Devices on
>> the motherboard suggest passing the WHQL tests for a whole system.
>>
>> I do understand the desire to avoid handling surprise removal, it can be
>> hard.
>>
>> I guess I’m one of those people who thinks drivers should try hard to
>> not crash the OS when hardware failures happen. I know some developers
>> see the hardware through rose colored glasses, and don’t feel you need
>> to cope with malfunctioning hardware. I’ve heard the “my driver never
>> needs to cope because…” reasoning before, and seen hardware do things
>> it’s not supposed to do.
>>
>> If your hardware is firmware controlled, which a LOT of modern hardware
>> is, then that firmware will occasionally crash, and your interface to
>> the device suddenly becomes undefined. Like for example, say your
>> hardware has a ring index register in its BAR window that you read and
>> then use as an index to an in memory structure, you should be noticing
>> if you read all 0xFF and if so, declare your hardware failed, not use
>> all 0XFF as the index into your ring. A hardware designs that DMAs the
>> ring index into a memory is safer, because if the hardware goes away you
>> can still access ram, likely with its previous values. Writing to a BAR
>> registers that is gone tends to be less dangerous. Hardware registers
>> that have all 0XFF as valid read data are really problematic. I know the
>> PCIe 2.0 spec did not spell out what values would be read during a
>> master/target about, although did say it would be the same as the PCI
>> spec, which did say you get all 0xFFs. The motherboard chipsets that
> control
> the root complex also usually did spell out reading 0xFFs on
> target/master abort.
>>
>> Handing surprise removal correctly can be pretty tricky, like say you
>> initiated I/O, and you gave the physical addresses of user buffers to
>> the hardware for DMA, and now you can’t control the hardware because the
>> BARs no longer seem to work. Do you cancel those I/Os, betting the
>> hardware has forgotten about the addresses you gave it. If you cancel
>> the requests from the software side, but the hardware was having some
>> transient problem and comes back, you now may find the hardware wants to
>> do DMA to user buffer pages that are no longer locked buffers. Getting
>> this right takes careful hardware and driver cooperation, and correct
>> implementation in hardware of reset semantics. I’ve seen hardware that
>> does not respect the PCI Function reset, and does not respect the PCI
>> command bits, so there was no way from software to assure the device was
>> made safe.
>>
>> I almost want to see complex hardware be noticing a driver watchdog
>> reset, like it reverts to a guaranteed known idle state if it’s doesn’t
>> hear from the driver for some time. On the other hand, hardware that
>> changes state on its own, like when we freeze the system with the kernel
>> debugger, is annoying too, as we might be debugging something else. You
>> might need a global flag to enable or disable hardware device watchdogs,
>> and each driver would need to disable the device watchdog if the global
>> flag was set.
>>
>> Jan
>>
>> -----Original Message-----
>> From: xxxxx@lists.osr.com
>> [mailto:xxxxx@lists.osr.com] On Behalf Of John McNamee
>> Sent: Saturday, September 22, 2012 4:32 PM
>> To: Windows System Software Devs Interest List
>> Subject: Re: [ntdev] On Supporting SURPRISE_REMOVAL and STOP in driver
>>
>> Doron,
>>
>> Sorry for any thread drift, but I have two related questions that I’ve
>> wondered about for a while…
>>
>> Is there a Real World scenario where a PCI device soldered down on the
>> system board will ever be surprise removed?
>>
>> Is there a Real World scenario where a PCI device soldered down on the
>> system board, and with fixed resource assignments (declared in ACPI and
>> set by BIOS during boot), will ever be stopped for rebalance?
>>
>> KMDF handles these situations (Thank You!), so it’s not a problem. I’m
>> just curious if these things ever really happen outside of WHCK.
>>
>> --John
>>
>>
>> -------- Original Message --------
>> Subject: Re: [ntdev] On Supporting SURPRISE_REMOVAL and STOP in driver
>> From: Doron Holan
>> To: Windows System Software Devs Interest List
>> Date: 9/22/2012 3:46 PM
>>
>>> It makes no sense to actively block stop. In a kmdf driver remove,
>>> stop, and power down are all the same code paths. So from that
>>> perspective, it is all tested and executed already
>>>
>>> d
>>>
>>> debt from my phone
>>> ----------------------------------------------------------------------
>>> --------
>>> From: xxxxx@gmail.com
>>> Sent: 9/22/2012 12:04 PM
>>> To: Windows System Software Devs Interest List
>>> Subject: RE:[ntdev] On Supporting SURPRISE_REMOVAL and STOP in driver
>>>
>>> Thank you, Jan and Doron.
>>>
>>> Doron: my bad, not QuerySurpriseRemove, just QueryRemove. Thank you
>>> for pointing that out. This is indeed a KMDF driver.
>>>
>>> I was actually trying to understand if there could ever be a rebalance
>>> of resources for a driver that does not actually manage any physical
>>> devices.
>>> What is the correct/recommended behaviour of a non-core wdf device
>>> driver when it receives the following: QUERY_STOP/STOP and
>>> SURPRISE_REMOVE irps ?
>>> Especially when there is no device to remove?
>>>
>>> Thank you and Best regards
>>> Sharma
>
> —
> NTDEV is sponsored by OSR
>
> For our schedule of WDF, WDM, debugging and other seminars visit:
> http://www.osr.com/seminars
>
> To unsubscribe, visit the List Server section of OSR Online at
> http://www.osronline.com/page.cfm?name=ListServer
>

If you don’t support query-remove, then the bus under which your device is
enumerated can’t be stopped. (It’s impossible to prove that you have no
dependencies on it.) And that means that the under which it is enumerated
can’t be stopped. This chaining of failures, even when your device has no
resources, can often mean that hot-plugged devices don’t start.

Jake Oshins
Windows Kernel Team

This message offers no warranties and confers no rights.

wrote in message news:xxxxx@ntdev…

Thank you, Jan and Doron.

Doron: my bad, not QuerySurpriseRemove, just QueryRemove. Thank you for
pointing that out. This is indeed a KMDF driver.

I was actually trying to understand if there could ever be a rebalance of
resources for a driver that does not actually manage any physical devices.
What is the correct/recommended behaviour of a non-core wdf device driver
when it receives the following: QUERY_STOP/STOP and SURPRISE_REMOVE irps ?
Especially when there is no device to remove?

Thank you and Best regards
Sharma

There are a few cases which you might not think of as surprise removal but
which are essentially equivalent to it which can happen to devices which are
soldered down to the motherboard. The most interesting one is the one that
Doron mentioned. If you respond to IRP_MN_QUERY_PNP_DEVICE_STATE with
PNP_DEVICE_FAILED (or if some filter, or a port driver like NDIS responds
with that) then your driver will see surprise removal.

A similar case will occur if your driver doesn’t respond within the PCI
Express spec timings. We did a lot of work in Windows 8 to enable much
better runtime power management. As part of that, I completely rewrote the
power policy engine for PCI buses in PCI.sys, to allow for PCI Express ports
to be put in D3cold (or any other low-power state) while the devices
themselves are turned off. (This only occurs in S0 on new machines with
firmware convinces us that the machine was designed for this scenario.) As
part of this rewrite, I had to add code to respect the bus timings. The
spec says that you need to wait 100ms after bringing up a link before you
touch configuration space of an endpoint. It also says that you have to
wait another 100ms after you write a D0 into the PMCSR before you can do
anything else to the device. Windows 7 played pretty loose with this stuff,
since it didn’t turn buses off at run time. So I added code to respect
these timings. And we’re finding that there are a few devices in the world
that don’t work unless you wait longer. The typical failure mode is that
the device fails to report the same Vendor/Device/SubsystemID set that it
reported before the bus was turned off. This will result in surprise
removal. A new bus scan will usually find your old device and you’ll see a
new add.

Unfortunately, this same new code is used while brining the machine out of
S3 and S4, as well as in S0 on new machines. So we’re seeing a few failures
even with existing devices.

Jake Oshins
Windows Kernel Team

This message offers no warranties and confers no rights.

“Doron Holan” wrote in message
news:xxxxx@ntdev…
Yes, s.r. also can happen due to software failure. If you report the state
as failed, you get a s.r. A filter can do this too. Not sure on the stop,
but I think that while windows tries really hard to keep the bios assigned
hw settings, I think it can be changed. Not sure how resource balance above
would affect this either if upstream a new device showed via hotplug. Jake
would know more definitively on both counts.

d

debt from my phone
From: John McNamee
Sent: 9/22/2012 4:32 PM
To: Windows System Software Devs Interest List
Subject: Re: [ntdev] On Supporting SURPRISE_REMOVAL and STOP in driver

Doron,

Sorry for any thread drift, but I have two related questions that I’ve
wondered about for a while…

Is there a Real World scenario where a PCI device soldered down on the
system board will ever be surprise removed?

Is there a Real World scenario where a PCI device soldered down on the
system board, and with fixed resource assignments (declared in ACPI and set
by
BIOS during boot), will ever be stopped for rebalance?

KMDF handles these situations (Thank You!), so it’s not a problem. I’m just
curious if these things ever really happen outside of WHCK.

–John

-------- Original Message --------
Subject: Re: [ntdev] On Supporting SURPRISE_REMOVAL and STOP in driver
From: Doron Holan
To: Windows System Software Devs Interest List
Date: 9/22/2012 3:46 PM

> It makes no sense to actively block stop. In a kmdf driver remove, stop,
> and
> power down are all the same code paths. So from that perspective, it is
> all
> tested and executed already
>
> d
>
> debt from my phone
> ------------------------------------------------------------------------------
> From: xxxxx@gmail.com
> Sent: 9/22/2012 12:04 PM
> To: Windows System Software Devs Interest List
> Subject: RE:[ntdev] On Supporting SURPRISE_REMOVAL and STOP in driver
>
> Thank you, Jan and Doron.
>
> Doron: my bad, not QuerySurpriseRemove, just QueryRemove. Thank you for
> pointing that out. This is indeed a KMDF driver.
>
> I was actually trying to understand if there could ever be a rebalance of
> resources for a driver that does not actually manage any physical devices.
> What is the correct/recommended behaviour of a non-core wdf device driver
> when
> it receives the following: QUERY_STOP/STOP and SURPRISE_REMOVE irps ?
> Especially when there is no device to remove?
>
> Thank you and Best regards
> Sharma


NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

I think I’d like to sit in a code review with you, John. I find that I can
often completely stymie somebody just by asking “what’s the ultimate result
if this function (picking almost any of them at random) returns failure.”
While you can often cause a two hour detour, the result is often a driver
with a very different, and better, architecture.

While Doron and I (and a bunch of other people) were working on building
KMDF 1.0, I remember a lot of discussions where we analyzed the KMDF
samples. We had a pile of drivers that we would update every time we
changed an interface. Eventually, we realized that one of our most
important measures of success while designing the interface was that the
samples had a really simple failure model. We tried to make it possible to
allocate most of what you really needed before I/O started flowing into your
driver. And we tried to make it possible to structure a driver where
teardown was the same regardless of whether it was happening in normal or
abnormal fashion.

We weren’t entirely successful. There are always things that I’d do better
in hindsight. And there are always places that you have to compromise. But
I think the object model in KMDF facilitates this pattern for the most part.

Jake Oshins
Windows Kernel Team

This message offers no warranties and confers no rights.

“John McNamee” wrote in message news:xxxxx@ntdev…

To pass the WHQL tests you need to handle surprise removal.
Devices on the motherboard suggest passing the WHQL tests for
a whole system.

Absolutely. Drivers have to handle SR and rebalance to pass WHQL, so
implementing them isn’t optional. I never questioned that. Luckily KMDF
makes it easy (certainly compared to WDM).

My question was whether SR and rebalance were basically corner cases for a
specific set of devices. I can’t ignore corner cases, but I want to give
them
an appropriate amount of attention. R&D resources are finite, even in large
companies. In this case, I think I’m OK assuming that passing WHQL is a
sufficient test. If I were working on a USB device, I’d probably be very
focused on SR, and might even develop my own stress tests for it.

I know some developers see the hardware through rose colored glasses,
and don’t feel you need to cope with malfunctioning hardware.

+1

My favorite questions during code reviews are “what happens if we get an
interrupt here?” and “what happens if the firmware dies here?”. Driver
developers need good answers to those questions for every line of code.

–John

-------- Original Message --------
Subject: Re: [ntdev] On Supporting SURPRISE_REMOVAL and STOP in driver
From: Jan Bottorff
To: Windows System Software Devs Interest List
Date: 9/22/2012 8:13 PM

> To pass the WHQL tests you need to handle surprise removal. Devices on the
> motherboard suggest passing the WHQL tests for a whole system.
>
> I do understand the desire to avoid handling surprise removal, it can be
> hard.
>
> I guess I’m one of those people who thinks drivers should try hard to not
> crash the OS when hardware failures happen. I know some developers see the
> hardware through rose colored glasses, and don’t feel you need to cope
> with malfunctioning hardware. I’ve heard the “my driver never needs to
> cope because…” reasoning before, and seen hardware do things it’s not
> supposed to do.
>
> If your hardware is firmware controlled, which a LOT of modern hardware
> is, then that firmware will occasionally crash, and your interface to the
> device suddenly becomes undefined. Like for example, say your hardware
> has a ring index register in its BAR window that you read and then use as
> an index to an in memory structure, you should be noticing if you read all
> 0xFF and if so, declare your hardware failed, not use all 0XFF as the
> index into your ring. A hardware designs that DMAs the ring index into a
> memory is safer, because if the hardware goes away you can still access
> ram, likely with its previous values. Writing to a BAR registers that is
> gone tends to be less dangerous. Hardware registers that have all 0XFF as
> valid read data are really problematic. I know the PCIe 2.0 spec did not
> spell out what values would be read during a master/target about, although
> did say it would be the same as the PCI spec, which did say you get all
> 0xFFs. The motherboard chipsets that
control
the root complex also usually did spell out reading 0xFFs on target/master
abort.
>
> Handing surprise removal correctly can be pretty tricky, like say you
> initiated I/O, and you gave the physical addresses of user buffers to the
> hardware for DMA, and now you can’t control the hardware because the BARs
> no longer seem to work. Do you cancel those I/Os, betting the hardware has
> forgotten about the addresses you gave it. If you cancel the requests from
> the software side, but the hardware was having some transient problem and
> comes back, you now may find the hardware wants to do DMA to user buffer
> pages that are no longer locked buffers. Getting this right takes careful
> hardware and driver cooperation, and correct implementation in hardware of
> reset semantics. I’ve seen hardware that does not respect the PCI Function
> reset, and does not respect the PCI command bits, so there was no way from
> software to assure the device was made safe.
>
> I almost want to see complex hardware be noticing a driver watchdog reset,
> like it reverts to a guaranteed known idle state if it’s doesn’t hear from
> the driver for some time. On the other hand, hardware that changes state
> on its own, like when we freeze the system with the kernel debugger, is
> annoying too, as we might be debugging something else. You might need a
> global flag to enable or disable hardware device watchdogs, and each
> driver would need to disable the device watchdog if the global flag was
> set.
>
> Jan
>
> -----Original Message-----
> From: xxxxx@lists.osr.com
> [mailto:xxxxx@lists.osr.com] On Behalf Of John McNamee
> Sent: Saturday, September 22, 2012 4:32 PM
> To: Windows System Software Devs Interest List
> Subject: Re: [ntdev] On Supporting SURPRISE_REMOVAL and STOP in driver
>
> Doron,
>
> Sorry for any thread drift, but I have two related questions that I’ve
> wondered about for a while…
>
> Is there a Real World scenario where a PCI device soldered down on the
> system board will ever be surprise removed?
>
> Is there a Real World scenario where a PCI device soldered down on the
> system board, and with fixed resource assignments (declared in ACPI and
> set by BIOS during boot), will ever be stopped for rebalance?
>
> KMDF handles these situations (Thank You!), so it’s not a problem. I’m
> just curious if these things ever really happen outside of WHCK.
>
> --John
>
>
> -------- Original Message --------
> Subject: Re: [ntdev] On Supporting SURPRISE_REMOVAL and STOP in driver
> From: Doron Holan
> To: Windows System Software Devs Interest List
> Date: 9/22/2012 3:46 PM
>
>> It makes no sense to actively block stop. In a kmdf driver remove,
>> stop, and power down are all the same code paths. So from that
>> perspective, it is all tested and executed already
>>
>> d
>>
>> debt from my phone
>> ----------------------------------------------------------------------
>> --------
>> From: xxxxx@gmail.com
>> Sent: 9/22/2012 12:04 PM
>> To: Windows System Software Devs Interest List
>> Subject: RE:[ntdev] On Supporting SURPRISE_REMOVAL and STOP in driver
>>
>> Thank you, Jan and Doron.
>>
>> Doron: my bad, not QuerySurpriseRemove, just QueryRemove. Thank you
>> for pointing that out. This is indeed a KMDF driver.
>>
>> I was actually trying to understand if there could ever be a rebalance
>> of resources for a driver that does not actually manage any physical
>> devices.
>> What is the correct/recommended behaviour of a non-core wdf device
>> driver when it receives the following: QUERY_STOP/STOP and
>> SURPRISE_REMOVE irps ?
>> Especially when there is no device to remove?
>>
>> Thank you and Best regards
>> Sharma

Jake,
Thank you for this update, it’s very helpful. Could you clarify one
point here, please?

On 23-Sep-2012 07:46, Jake Oshins wrote:
[snip]

A similar case will occur if your driver doesn’t respond within the PCI
Express spec timings.

The driver, or hardware/microcode/whatever ?

The spec says that you need to wait 100ms after
bringing up a link before you touch configuration space of an endpoint.
It also says that you have to wait another 100ms after you write a D0
into the PMCSR before you can do anything else to the device. Windows 7
played pretty loose with this stuff, since it didn’t turn buses off at
run time. So I added code to respect these timings.

So, are StartDevice and PnP D0 IRPs sent to driver before or after this
100 ms delay in Win8? Is the driver responsible to wait 100 ms before
accessing the hardware?

Thanks,
– pa

On 23-Sep-2012 03:13, Jan Bottorff wrote:

I almost want to see complex hardware be noticing a driver watchdog reset,
> like it reverts to a guaranteed known idle state if it’s doesn’t hear
from
> the driver for some time.

A very good idea. It was known as “child lost in a mall” pattern
in one place where I’ve worked (meaning, a child is told to
go straight to the info desk if lost).

– pa

Hmmmm… This thread started with a discussion about STOP processing. STOP has been historically reserved for resource rebalancing on the PCI bus.

Are there cases in which you can get a STOP on buses other than PCI in Win8? I believe I heard of such a case on SoC-type systems, but the description wasn’t clear and I could be entirely mistaken…

Peter
OSR