I think I’d like to sit in a code review with you, John. I find that I can
often completely stymie somebody just by asking “what’s the ultimate result
if this function (picking almost any of them at random) returns failure.”
While you can often cause a two hour detour, the result is often a driver
with a very different, and better, architecture.
While Doron and I (and a bunch of other people) were working on building
KMDF 1.0, I remember a lot of discussions where we analyzed the KMDF
samples. We had a pile of drivers that we would update every time we
changed an interface. Eventually, we realized that one of our most
important measures of success while designing the interface was that the
samples had a really simple failure model. We tried to make it possible to
allocate most of what you really needed before I/O started flowing into your
driver. And we tried to make it possible to structure a driver where
teardown was the same regardless of whether it was happening in normal or
abnormal fashion.
We weren’t entirely successful. There are always things that I’d do better
in hindsight. And there are always places that you have to compromise. But
I think the object model in KMDF facilitates this pattern for the most part.
Jake Oshins
Windows Kernel Team
This message offers no warranties and confers no rights.
“John McNamee” wrote in message news:xxxxx@ntdev…
To pass the WHQL tests you need to handle surprise removal.
Devices on the motherboard suggest passing the WHQL tests for
a whole system.
Absolutely. Drivers have to handle SR and rebalance to pass WHQL, so
implementing them isn’t optional. I never questioned that. Luckily KMDF
makes it easy (certainly compared to WDM).
My question was whether SR and rebalance were basically corner cases for a
specific set of devices. I can’t ignore corner cases, but I want to give
them
an appropriate amount of attention. R&D resources are finite, even in large
companies. In this case, I think I’m OK assuming that passing WHQL is a
sufficient test. If I were working on a USB device, I’d probably be very
focused on SR, and might even develop my own stress tests for it.
I know some developers see the hardware through rose colored glasses,
and don’t feel you need to cope with malfunctioning hardware.
+1
My favorite questions during code reviews are “what happens if we get an
interrupt here?” and “what happens if the firmware dies here?”. Driver
developers need good answers to those questions for every line of code.
–John
-------- Original Message --------
Subject: Re: [ntdev] On Supporting SURPRISE_REMOVAL and STOP in driver
From: Jan Bottorff
To: Windows System Software Devs Interest List
Date: 9/22/2012 8:13 PM
> To pass the WHQL tests you need to handle surprise removal. Devices on the
> motherboard suggest passing the WHQL tests for a whole system.
>
> I do understand the desire to avoid handling surprise removal, it can be
> hard.
>
> I guess I’m one of those people who thinks drivers should try hard to not
> crash the OS when hardware failures happen. I know some developers see the
> hardware through rose colored glasses, and don’t feel you need to cope
> with malfunctioning hardware. I’ve heard the “my driver never needs to
> cope because…” reasoning before, and seen hardware do things it’s not
> supposed to do.
>
> If your hardware is firmware controlled, which a LOT of modern hardware
> is, then that firmware will occasionally crash, and your interface to the
> device suddenly becomes undefined. Like for example, say your hardware
> has a ring index register in its BAR window that you read and then use as
> an index to an in memory structure, you should be noticing if you read all
> 0xFF and if so, declare your hardware failed, not use all 0XFF as the
> index into your ring. A hardware designs that DMAs the ring index into a
> memory is safer, because if the hardware goes away you can still access
> ram, likely with its previous values. Writing to a BAR registers that is
> gone tends to be less dangerous. Hardware registers that have all 0XFF as
> valid read data are really problematic. I know the PCIe 2.0 spec did not
> spell out what values would be read during a master/target about, although
> did say it would be the same as the PCI spec, which did say you get all
> 0xFFs. The motherboard chipsets that
control
the root complex also usually did spell out reading 0xFFs on target/master
abort.
>
> Handing surprise removal correctly can be pretty tricky, like say you
> initiated I/O, and you gave the physical addresses of user buffers to the
> hardware for DMA, and now you can’t control the hardware because the BARs
> no longer seem to work. Do you cancel those I/Os, betting the hardware has
> forgotten about the addresses you gave it. If you cancel the requests from
> the software side, but the hardware was having some transient problem and
> comes back, you now may find the hardware wants to do DMA to user buffer
> pages that are no longer locked buffers. Getting this right takes careful
> hardware and driver cooperation, and correct implementation in hardware of
> reset semantics. I’ve seen hardware that does not respect the PCI Function
> reset, and does not respect the PCI command bits, so there was no way from
> software to assure the device was made safe.
>
> I almost want to see complex hardware be noticing a driver watchdog reset,
> like it reverts to a guaranteed known idle state if it’s doesn’t hear from
> the driver for some time. On the other hand, hardware that changes state
> on its own, like when we freeze the system with the kernel debugger, is
> annoying too, as we might be debugging something else. You might need a
> global flag to enable or disable hardware device watchdogs, and each
> driver would need to disable the device watchdog if the global flag was
> set.
>
> Jan
>
> -----Original Message-----
> From: xxxxx@lists.osr.com
> [mailto:xxxxx@lists.osr.com] On Behalf Of John McNamee
> Sent: Saturday, September 22, 2012 4:32 PM
> To: Windows System Software Devs Interest List
> Subject: Re: [ntdev] On Supporting SURPRISE_REMOVAL and STOP in driver
>
> Doron,
>
> Sorry for any thread drift, but I have two related questions that I’ve
> wondered about for a while…
>
> Is there a Real World scenario where a PCI device soldered down on the
> system board will ever be surprise removed?
>
> Is there a Real World scenario where a PCI device soldered down on the
> system board, and with fixed resource assignments (declared in ACPI and
> set by BIOS during boot), will ever be stopped for rebalance?
>
> KMDF handles these situations (Thank You!), so it’s not a problem. I’m
> just curious if these things ever really happen outside of WHCK.
>
> --John
>
>
> -------- Original Message --------
> Subject: Re: [ntdev] On Supporting SURPRISE_REMOVAL and STOP in driver
> From: Doron Holan
> To: Windows System Software Devs Interest List
> Date: 9/22/2012 3:46 PM
>
>> It makes no sense to actively block stop. In a kmdf driver remove,
>> stop, and power down are all the same code paths. So from that
>> perspective, it is all tested and executed already
>>
>> d
>>
>> debt from my phone
>> ----------------------------------------------------------------------
>> --------
>> From: xxxxx@gmail.com
>> Sent: 9/22/2012 12:04 PM
>> To: Windows System Software Devs Interest List
>> Subject: RE:[ntdev] On Supporting SURPRISE_REMOVAL and STOP in driver
>>
>> Thank you, Jan and Doron.
>>
>> Doron: my bad, not QuerySurpriseRemove, just QueryRemove. Thank you
>> for pointing that out. This is indeed a KMDF driver.
>>
>> I was actually trying to understand if there could ever be a rebalance
>> of resources for a driver that does not actually manage any physical
>> devices.
>> What is the correct/recommended behaviour of a non-core wdf device
>> driver when it receives the following: QUERY_STOP/STOP and
>> SURPRISE_REMOVE irps ?
>> Especially when there is no device to remove?
>>
>> Thank you and Best regards
>> Sharma