The case of UMDF driver sudden death: the resolution

Michal_Vodicka-2 · August 3, 2009, 2:15pm

Last week I posted few messages about my UMDF driver suddenly
disappearing during the DTM machine test. Now I know where the problem
lies. There is nothing wrong with the driver, it is just clash between
DTM and UMDF. And UMDF is the loser.

It occurs during DTM Common I/O stress test. The test repeatedly (among
others) disables and re-enables all devices in the system. Including our
one which causes driver restart. During start UMDF kernel part receives
PnP IRP IRP_MN_START_DEVICE. In turn, it sends a message to the driver
and framework calls OnPrepareHardware() callback. There we configure USB
and prepare the device for use. Normally, it takes few hundreds ms. But
under the test, it is much longer. The whole machine is so overloaded
that mouse cursor is usually frozen. It slows down our driver in 3
orders of magnitude. And guess what. The message sent to driver timeouts
and UMDF driver manager kills the driver…

The timeout is by default 60 sec. Normally it is more than enough but
not in this case. I extended the timeout to 1 hour and left the test
running for whole weekend. The driver was restarted more than 300 times
and average OnPrepareHardware() callback processing time was about 260
sec (!). The worst time was 360 sec i.e 6 times longer than the default
timeout.

I’m going to file a bug report. This test is run by all OEMs and an UMDF
driver doesn’t have a big chance to pass it. With default configuration,
we’re going to release driver with timeout extended as short-term
solution. The way how to do it is of course undocumented. However, I
believe it is a problem for all UMDF drivers and should be solved inside
framework. In other cases it extends timeout when it seems driver makes
some progress. Not in this case. In addition, it seems worker threads
calling callbacks run on default priority. Raising could help but it
should be done inside framework, too.

The worst part was diagnostic. In such a case, when driver is
intentionally murdered, I’d expect some reasonable message in the Event
Log about it. There was only generic one suggesting that driver crashed.
There are silent WER reports when one knows where to find them and how
to configure WER to report at least something (was disabled at DTM
machine). WER created short descriptions and failed when copying WDF
traces and creating dumps. It was probably because of the stress tests
running. It was able to copy traces in about 10% of cases and in just
one case it created dumps. In other cases it just wrote about dump
creating failure. In many cases it simply didn’t do anything. There were
about 10x more messages in the event log than WER reports.

Actually, even small WER report file contained the useful info. Which
can be only deciphered when one knows where the problem is and/or have
to access to UMDF sources. For the future reference (only important
fields):

Sig[1].Name=Problem
Sig[1].Value=HostTimeout

It means host didn’t process a message within the timeout and was
killed. In this case, maybe there are others.

Sig[4].Name=ExitCode
Sig[4].Value=103

It is STATUS_PENDING as we all know. The status of the IRP in the time
of timeout. Quite misleading as an exit code.

Sig[6].Name=Message
Sig[6].Value=11b00

The most interesting part. It is OR of:

1 << 16 IRP related message
0x1b << 8 IRP_MJ_PNP
0x0 IRP_MN_START_DEVICE

There were also messages 11602 when it was power IRP related message
which was found in the queue as the first timeouted.

It is beyond my understanding how are driver developers expected to
decode it. Maybe it is documented somewhere but Google can’t find it.

Looking back I made one mistake. I thought the host timeout problem is
the consequence of driver crash. So I didn’t try to decipher what WER
report fields really mean and tried to find why driver crashes, instead.
Finally I found it and discovered that the timeout was the reason and
the crash/termination the consequence. Who with kernel experience would
expect it, killing driver because of a timeout seems as a really strange
idea… If it is documented somewhere, I’d like to know.

Best regards,

Michal Vodicka
UPEK, Inc.
[xxxxx@upek.com, http://www.upek.com http:</http:>]

Maxim_S_Shatskih · August 3, 2009, 4:34pm

>guess what. The message sent to driver timeouts and UMDF driver manager kills the driver…

Without an event log warning or error message?

This is the worst case in all of this. I would say this is a bug in wudfhost.

Or is the timeout handling is in your code and not UMDFs? then why you do not log a trace when the timeout fires? for me, it is more or less obvious that each timeout event is a good candidate for logging.

–
Maxim S. Shatskih
Windows DDK MVP
xxxxx@storagecraft.com
http://www.storagecraft.com

Michal_Vodicka-2 · August 3, 2009, 4:45pm

> -----Original Message-----

From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Maxim
S. Shatskih
Sent: Monday, August 03, 2009 10:34 PM
To: Windows System Software Devs Interest List
Subject: Re:[ntdev] The case of UMDF driver sudden death: the
resolution

>guess what. The message sent to driver timeouts and UMDF
driver manager kills the driver…

Without an event log warning or error message?

There was a generic message about driver crash in system event log. This
is the same for any crash.

There were following traces in ETW WUDF log which indicated the problem:

WdfLpcCommPort::WudfSearchTimeoutMessagesWithLockHeld: lpcMsg has been
cancelled 0xFFFFF98005682E20
WdfLpcCommPort::ProcessMessage: LpcMsg
(FFFFF88002FC8B68,FFFFF9800CC5CE20) was cancelled
lpc handle FFFFF98021004FF0 received LPC_CANCELED msg

I’d like to know who’d read the last one as “we’re going to kill your
driver”.

This is the worst case in all of this. I would say this is a
bug in wudfhost.

At least insufficient error reporting.

Or is the timeout handling is in your code and not UMDFs?
then why you do not log a trace when the timeout fires? for
me, it is more or less obvious that each timeout event is a
good candidate for logging.

Surely no. It was UMDF driver manager. I had no idea it kills drivers
which don’t process request within a timeout and had no idea about
timeout value. Yes, it was apparent from my traces processing is much
slower than usual. But I didn’t make a connection because the idea of
killing drivers didn’t come into mind.

Best regards,

Michal Vodicka
UPEK, Inc.
[xxxxx@upek.com, http://www.upek.com]

OSR_Community_User · August 3, 2009, 11:08pm

Thanks for your feedback Michal. We’d check the trace messages and plan for improvements.

http://www.microsoft.com/whdc/driver/wdf/UMDF-arch.mspx

has a UMDF architecture doc that talks about UMDF adaptive timeouts. We plan to integrate such info in WDK docs.

Restarting drivers and timeouts are new approaches with UMDF model (although timeout is not entirely new) so you are right that they are not encountered (in the same form) with km drivers.

With UMDF, drivers are in user mode which means they are not trusted code, so frameworks attempts to protect system from any misbehaved driver*. Hence when critical operations are blocked by the driver (pnp, power, I/O cancellation, cleanup, close) framework enforces a timeout. This timeout is currently 1 minute and is adaptive in some cases - i.e. if framework detects progress it extends the timeout.

Driver restart attempts to recover from errors (currently 5 times). It is true that in some cases it would just cause the same error to happen 5 times, but for intermittent crashes it provides recoverability to the user without having to restart the machine.

Both these mechanisms won’t work perfectly in all cases, and are likely to be fine tuned with experience.

The following section describes debugging in timeout cases

http://msdn.microsoft.com/en-us/library/bb743209.aspx

Thanks,
Praveen

* I realize that in your case the cause was stress and not misbehavior of your driver.

This post implies no warranties and confers no rights.

Chris_Aseltine · August 3, 2009, 11:29pm

Michal Vodicka wrote:

Looking back I made one mistake.

Using UMDF?

Michal_Vodicka-2 · August 4, 2009, 11:47am

Praveen,

comments inside:

Thanks for your feedback Michal. We’d check the trace
messages and plan for improvements.

That would be great. Please consider also improving event log messages.
This case is important enough to deserve separate message.

http://www.microsoft.com/whdc/driver/wdf/UMDF-arch.mspx

has a UMDF architecture doc that talks about UMDF adaptive
timeouts. We plan to integrate such info in WDK docs.

I read this paper long time ago. Maybe some previous version, I’m not
sure if timeout part was there, I don’t remember it. Still, it is a bit
vague and I’m not sure if it’d help me to find the reason why my driver
crashed.

Please document WER report content. It contains all necessary info but
we need to know what the fields mean.

Restarting drivers and timeouts are new approaches with UMDF
model (although timeout is not entirely new) so you are right
that they are not encountered (in the same form) with km drivers.

With UMDF, drivers are in user mode which means they are not
trusted code, so frameworks attempts to protect system from
any misbehaved driver*. Hence when critical operations are
blocked by the driver (pnp, power, I/O cancellation, cleanup,
close) framework enforces a timeout. This timeout is
currently 1 minute and is adaptive in some cases - i.e. if
framework detects progress it extends the timeout.

That’s not bad but it is new and unexpected and should be emphasized for
kernel mode devs. And reported clearly.

However, there is a real problem and I reported it as a Win7 bug. The
timeout is not sufficient in DTM test scenario and every OEM uses this
test. I even believe almost no UMDF driver would pass it. From my traces
it seems as even basic USB configuration (configuring pipes) took more
than 1 minute in most cases.

Please test it internally and solve it somewhat. Make timeout adaptive,
raise threads priority when processing important requests. It is not
trusted code but still it is a driver and should have precedence over
apps during stress.

I could raise thread priorities myself but it doesn’t seem as a good
idea. Worker threads are owned by UMDF and I have no control which one
is used for which request.

Driver restart attempts to recover from errors (currently 5
times). It is true that in some cases it would just cause the
same error to happen 5 times, but for intermittent crashes it
provides recoverability to the user without having to restart
the machine.

Thinking more about it, it seems as evil feature. For sure in test
scenarios because masks some problems as in our case. I will instruct
our QA to set it to zero as I want to know about any driver crash. For
DTM the same, it should set it to 0 automatically. At user’s machines it
is questionable. Not so bad if WER really works and I as a developer can
receive all the info about crashes. Unfortunately, it doesn’t. As I
wrote, it was really unreliable, in most cases there was no report at
all. Also, I’d like to know how and when these silent reports are sent
to WinQual. I don’t see any for our driver there although we generated a
lot of reports during tests and machines have Internet connection. How
it is expected to work?

Both these mechanisms won’t work perfectly in all cases, and
are likely to be fine tuned with experience.

When and by whom? We just set timeout to 550 sec. Using undocumented
registry value. 550 sec because it is less than 600 sec which is timeout
for power IRPs. We had one BSOD because of this when timeout was longer.
I guess it is better to have terminated UMDF driver than BSOD.

The following section describes debugging in timeout cases

http://msdn.microsoft.com/en-us/library/bb743209.aspx

Yes, but you have to know the host process was terminated by reflector.
Which I didn’t know. I used approaches described there but it didn’t
help too much. I even used kernel debugger and HostFailKdDebugBreak and
catched the crash. It still didn’t help. I could see IRP_MN_START_DEVICE
queued but it wasn’t anything new, I already saw it from my traces
driver disappears during this IRP processing. I just didn’t know why it
disappeared.

Traces were more helpful but is should be improved, too. There is one
file with limited size (hardcoded 1MB) and it is easy to loose necessary
traces. What about keeping several last files (10, configurable) per
driver run? I know traces are copied to WER report which would be enough
if it works but it doesn’t in most cases. You should also document
traces flags and level. I had to set the highest bit (0 by default) to
see function calls. Traces are in TMF files (which is good) so related
flags should be documented. Also, why to hardcode 1MB? It should be
configurable, with fuction calls info it is easy to overflow it.

Best regards,

Michal Vodicka
UPEK, Inc.
[xxxxx@upek.com, http://www.upek.com]

Michal_Vodicka-2 · August 4, 2009, 11:49am

> -----Original Message-----

From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of
xxxxx@gmail.com
Sent: Tuesday, August 04, 2009 5:30 AM
To: Windows System Software Devs Interest List
Subject: RE:[ntdev] The case of UMDF driver sudden death: the
resolution

Michal Vodicka wrote:

> Looking back I made one mistake.

Using UMDF?

No, it wasn’t *my* mistake. We were forced to use it.

Best regards,

Michal Vodicka
UPEK, Inc.
[xxxxx@upek.com, http://www.upek.com]

Chris_Aseltine · August 4, 2009, 11:52am

Michal Vodicka wrote:

No, it wasn’t *my* mistake. We were forced to use it.

Ahh … well, after reading your epilogue, that’s certainly how I would feel. Same goes for these people who are (re)discovering that you can’t upgrade from a beta WDF coinstaller to the RTM one. Surprise!

Michal_Vodicka-2 · August 4, 2009, 12:10pm

> -----Original Message-----

From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of
xxxxx@gmail.com
Sent: Tuesday, August 04, 2009 5:53 PM
To: Windows System Software Devs Interest List
Subject: RE:[ntdev] The case of UMDF driver sudden death: the
resolution

Ahh … well, after reading your epilogue, that’s certainly
how I would feel.

Child illnesses, this technology is still young. Somebody has to test it
(but why me? ;-)) and from questions in this list it seems as only few
people use it.

I can understand UMDF team devs who live with this project don’t see how
complicated it is from outside and how much necessary info is missing.
That’s why I wrote it. Hopefully it will have some effect.

Best regards,

Michal Vodicka
UPEK, Inc.
[xxxxx@upek.com, http://www.upek.com]

Michal_Vodicka-2 · August 4, 2009, 12:12pm

Oh, I forgot. What is really missing are framework sources. It was said
many times so once more.

Best regards,

Michal Vodicka
UPEK, Inc.
[xxxxx@upek.com, http://www.upek.com]

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Michal Vodicka
Sent: Tuesday, August 04, 2009 6:10 PM
To: Windows System Software Devs Interest List
Subject: RE: [ntdev] The case of UMDF driver sudden death:
the resolution

> -----Original Message-----
> From: xxxxx@lists.osr.com
> [mailto:xxxxx@lists.osr.com] On Behalf Of
> xxxxx@gmail.com
> Sent: Tuesday, August 04, 2009 5:53 PM
> To: Windows System Software Devs Interest List
> Subject: RE:[ntdev] The case of UMDF driver sudden death: the
> resolution
>
> Ahh … well, after reading your epilogue, that’s certainly
> how I would feel.

Child illnesses, this technology is still young. Somebody has
to test it
(but why me? ;-)) and from questions in this list it seems as only few
people use it.

I can understand UMDF team devs who live with this project
don’t see how
complicated it is from outside and how much necessary info is missing.
That’s why I wrote it. Hopefully it will have some effect.

Best regards,

Michal Vodicka
UPEK, Inc.
[xxxxx@upek.com, http://www.upek.com]

NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online
at http://www.osronline.com/page.cfm?name=ListServer

Ilya_Faenson · August 5, 2009, 7:54am

Although UMDF may bear partial responsibility for this failure, I believe the considerable fault here lies with the Common I/O Stress Test itself.

We happen to have a similar enough issue where a user mode component is involved in starting a device of ours. A service switches the device from the boot into the functional mode (boot mode looks like a HID to support the BIOS operation). That service does not get a chance to run for minutes under this test on slower laptops about to be shipped now by major OEMs. A couple of minutes is typical. 100% CPU usage is reported for virtually the whole duration.

The primary (in my opinion) issue here is that the test does not adjust its operation depending on the kind of machine it runs on. Believe it or not, some laptops still ship with a lone single-threaded CPU. I am not even talking netbooks… I think the test should adjust its load depending on a machine it encounters. It should certainly be preventing 100% CPU load for the whole duration of the device restart.

Just my $.02,

Ilya Faenson
Rockville, MD USA

Michal_Vodicka-2 · August 5, 2009, 12:17pm

I think that maximal CPU utilization is the purpose of the test, not the
fault. We have similar results at rather old single core machine and at
the newest multi-core laptop prototype. In both cases I can see similar
times of start IRP processing in my UMDF driver. Both much longer than
default WUDF host timeout.

I believe the problem is at UMDF side. The test is here for a long time
and kernel drivers (which don’t depend on user mode part as in your
case) have no problem with the test. If MS decided to push UMDF, they
should ensure user mode drivers behave similarly and are able to pass
this test. If it means running their threads at real time priority,
that’s life. If even this doesn’t help, other solution needs to be made.
At UMDF side, it is inherent framework problem, not driver problem.

Just curious: have you tried to raise your service thread priorities? Or
do you mean service wasn’t even started for the long time? If so, it is
a bit different problem than UMDF one as everything necessary was
running there, just slowly.

Best regards,

Michal Vodicka
UPEK, Inc.
[xxxxx@upek.com, http://www.upek.com]

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of
xxxxx@hotmail.com
Sent: Wednesday, August 05, 2009 1:56 PM
To: Windows System Software Devs Interest List
Subject: RE:[ntdev] The case of UMDF driver sudden death: the
resolution

Although UMDF may bear partial responsibility for this
failure, I believe the considerable fault here lies with the
Common I/O Stress Test itself.

We happen to have a similar enough issue where a user mode
component is involved in starting a device of ours. A
service switches the device from the boot into the functional
mode (boot mode looks like a HID to support the BIOS
operation). That service does not get a chance to run for
minutes under this test on slower laptops about to be shipped
now by major OEMs. A couple of minutes is typical. 100% CPU
usage is reported for virtually the whole duration.

The primary (in my opinion) issue here is that the test does
not adjust its operation depending on the kind of machine it
runs on. Believe it or not, some laptops still ship with a
lone single-threaded CPU. I am not even talking netbooks…
I think the test should adjust its load depending on a
machine it encounters. It should certainly be preventing
100% CPU load for the whole duration of the device restart.

Just my $.02,

Ilya Faenson
Rockville, MD USA

NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online
at http://www.osronline.com/page.cfm?name=ListServer

Ilya_Faenson · August 5, 2009, 2:35pm

Michal,

We have been observing statistically significant improved start times on better hardware so our experiences differ somewhat in that regard. Note however that it may be partially due to different roles user mode code plays for each of us. UMDF service helps start a device which is being started already. On the other hand, no device start proceeds for us until our service does its device creation magic. I do agree that UMDF needs to be improved too though, no doubt.

However, the test happens to enforce a device start within 60 seconds regardless of how busy the computer is. The test can’t then take all the CPU in this time frame in my opinion. It should leave some CPU for normal priority threads to run. Yeah, we’ve been discussing raising the service thread priorities but we feel it’s the wrong approach to take. Much of the required functionality is within Microsoft threads rather than the service proper anyway and we have little control over them. Might still try it.

Answering your question, the service does get started but runs very slowly. You and I are dealing with basically the same problem therefore although we’re somewhat luckier as the service is under under control. Not much of a consolation now though as OEMs are very reluctant to let us change anything that close to the Win7 launch time since everything does work functionally. Nobody but this test cares if the device start takes two minutes rather than one when the CPU utilization is 100%.

Best,

Ilya Faenson
Rockville, MD USA

Michal_Vodicka-2 · August 5, 2009, 2:46pm

Ilya,

small correction: it is not the test but UMDF driver manager which
enforces 60 sec timeout. AFAIK, the test itself doesn’t care about the
time driver starts. But UMDF kills the driver host process if a PnP IRP
isn’t processed within 60 sec. So I believe the problem is purely at
UMDF side.

Otherwise it really seems similar, I guess I’ll try to play with
priorities when have time and DTM machine available.

Best regards,

Michal Vodicka
UPEK, Inc.
[xxxxx@upek.com, http://www.upek.com]

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of
xxxxx@hotmail.com
Sent: Wednesday, August 05, 2009 8:37 PM
To: Windows System Software Devs Interest List
Subject: RE:[ntdev] The case of UMDF driver sudden death: the
resolution

Michal,

We have been observing statistically significant improved
start times on better hardware so our experiences differ
somewhat in that regard. Note however that it may be
partially due to different roles user mode code plays for
each of us. UMDF service helps start a device which is being
started already. On the other hand, no device start proceeds
for us until our service does its device creation magic. I
do agree that UMDF needs to be improved too though, no doubt.

However, the test happens to enforce a device start within 60
seconds regardless of how busy the computer is. The test
can’t then take all the CPU in this time frame in my opinion.
It should leave some CPU for normal priority threads to run.
Yeah, we’ve been discussing raising the service thread
priorities but we feel it’s the wrong approach to take. Much
of the required functionality is within Microsoft threads
rather than the service proper anyway and we have little
control over them. Might still try it.

Answering your question, the service does get started but
runs very slowly. You and I are dealing with basically the
same problem therefore although we’re somewhat luckier as the
service is under under control. Not much of a consolation
now though as OEMs are very reluctant to let us change
anything that close to the Win7 launch time since everything
does work functionally. Nobody but this test cares if the
device start takes two minutes rather than one when the CPU
utilization is 100%.

Best,

Ilya Faenson
Rockville, MD USA

NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online
at http://www.osronline.com/page.cfm?name=ListServer

Ilya_Faenson · August 5, 2009, 3:13pm

Michal,

DTM Sleep Stress with IO and alike also fail with “missing device” error in 60 seconds (our devices generally lose power and get removed in suspend and hibernation scenarios) so a minute seems to be a popular default timeout at Microsoft.

Thanks for the info,

Ilya Faenson
Rockville, MD USA

Peter_Wieland · August 5, 2009, 3:29pm

Which test is failing you because you fail to startup in 60 seconds? Not what’s killing your driver, but what test reports failure?

-p

-----Original Message-----
From: xxxxx@lists.osr.com [mailto:xxxxx@lists.osr.com] On Behalf Of Michal Vodicka
Sent: Wednesday, August 05, 2009 11:46 AM
To: Windows System Software Devs Interest List
Subject: RE: [ntdev] The case of UMDF driver sudden death: the resolution

Ilya,

small correction: it is not the test but UMDF driver manager which enforces 60 sec timeout. AFAIK, the test itself doesn’t care about the time driver starts. But UMDF kills the driver host process if a PnP IRP isn’t processed within 60 sec. So I believe the problem is purely at UMDF side.

Otherwise it really seems similar, I guess I’ll try to play with priorities when have time and DTM machine available.

Best regards,

Michal Vodicka
UPEK, Inc.
[xxxxx@upek.com, http://www.upek.com]

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of
xxxxx@hotmail.com
Sent: Wednesday, August 05, 2009 8:37 PM
To: Windows System Software Devs Interest List
Subject: RE:[ntdev] The case of UMDF driver sudden death: the
resolution

Michal,

We have been observing statistically significant improved start times
on better hardware so our experiences differ somewhat in that regard.
Note however that it may be partially due to different roles user mode
code plays for each of us. UMDF service helps start a device which is
being started already. On the other hand, no device start proceeds
for us until our service does its device creation magic. I do agree
that UMDF needs to be improved too though, no doubt.

However, the test happens to enforce a device start within 60 seconds
regardless of how busy the computer is. The test can’t then take all
the CPU in this time frame in my opinion.
It should leave some CPU for normal priority threads to run.
Yeah, we’ve been discussing raising the service thread priorities but
we feel it’s the wrong approach to take. Much of the required
functionality is within Microsoft threads rather than the service
proper anyway and we have little control over them. Might still try
it.

Answering your question, the service does get started but runs very
slowly. You and I are dealing with basically the same problem
therefore although we’re somewhat luckier as the service is under
under control. Not much of a consolation now though as OEMs are very
reluctant to let us change anything that close to the Win7 launch time
since everything does work functionally. Nobody but this test cares
if the device start takes two minutes rather than one when the CPU
utilization is 100%.

Best,

Ilya Faenson
Rockville, MD USA

NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer

OSR_Community_User · August 5, 2009, 3:46pm

Michal,

We just set timeout to 550 sec. Using undocumented
registry value.

Do you do this only on the DTM machine? Similarly regarding thread priority - do you modify it only on the test system?

You shouldn’t have to (and shouldn’t) modify shipping code/inf to make the test pass - much less so with an undocumented regkey. We should find some other way for that.

Regarding driver restart, yes during testing the restart count should be set to 0 as you suggested. We are in the process of documenting it per your feedback.

Tuning the timeouts and restart policy will be upon UMDF team. Similarly your other feedback is well taken and is valuable. Although the standard disclaimer of not being able to promise any specific changes or timelines applies.

Thanks,
Praveen

ps: The links I mentioned were more for info/reference and not to imply that you should have looked at them :-).

This post implies no warranties and confers no rights.

Michal_Vodicka-2 · August 5, 2009, 3:53pm

System - Common Scenario Stress With IO. It repeatedly disables and
enables all devices in system including ours and when driver is killed
because of timeout, the test fails. DTM log contains following:

B(3).3 - The device’s Problem Codes (if any) and its Status should not
change throughout this test. {ADCBD1E8-16E8-11DB-B50B-0010C65AFEB5}

And error: Problem Code has changed to: 10 (which is problem starting
device).

One OEM also claimed that sometimes also fails System - Disable/Enable
test but we weren’t able to reproduce it.

Best regards,

Michal Vodicka
UPEK, Inc.
[xxxxx@upek.com, http://www.upek.com]

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Peter Wieland
Sent: Wednesday, August 05, 2009 9:23 PM
To: Windows System Software Devs Interest List
Subject: RE: [ntdev] The case of UMDF driver sudden death:
the resolution

Which test is failing you because you fail to startup in 60
seconds? Not what’s killing your driver, but what test
reports failure?

-p

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Michal Vodicka
Sent: Wednesday, August 05, 2009 11:46 AM
To: Windows System Software Devs Interest List
Subject: RE: [ntdev] The case of UMDF driver sudden death:
the resolution

Ilya,

small correction: it is not the test but UMDF driver manager
which enforces 60 sec timeout. AFAIK, the test itself doesn’t
care about the time driver starts. But UMDF kills the driver
host process if a PnP IRP isn’t processed within 60 sec. So I
believe the problem is purely at UMDF side.

Otherwise it really seems similar, I guess I’ll try to play
with priorities when have time and DTM machine available.

Best regards,

Michal Vodicka
UPEK, Inc.
[xxxxx@upek.com, http://www.upek.com]

> -----Original Message-----
> From: xxxxx@lists.osr.com
> [mailto:xxxxx@lists.osr.com] On Behalf Of
> xxxxx@hotmail.com
> Sent: Wednesday, August 05, 2009 8:37 PM
> To: Windows System Software Devs Interest List
> Subject: RE:[ntdev] The case of UMDF driver sudden death: the
> resolution
>
> Michal,
>
> We have been observing statistically significant improved
start times
> on better hardware so our experiences differ somewhat in
that regard.
> Note however that it may be partially due to different
roles user mode
> code plays for each of us. UMDF service helps start a
device which is
> being started already. On the other hand, no device start proceeds
> for us until our service does its device creation magic. I do agree
> that UMDF needs to be improved too though, no doubt.
>
> However, the test happens to enforce a device start within
60 seconds
> regardless of how busy the computer is. The test can’t
then take all
> the CPU in this time frame in my opinion.
> It should leave some CPU for normal priority threads to run.
> Yeah, we’ve been discussing raising the service thread
priorities but
> we feel it’s the wrong approach to take. Much of the required
> functionality is within Microsoft threads rather than the service
> proper anyway and we have little control over them. Might still try
> it.
>
> Answering your question, the service does get started but runs very
> slowly. You and I are dealing with basically the same problem
> therefore although we’re somewhat luckier as the service is under
> under control. Not much of a consolation now though as
OEMs are very
> reluctant to let us change anything that close to the Win7
launch time
> since everything does work functionally. Nobody but this test cares
> if the device start takes two minutes rather than one when the CPU
> utilization is 100%.
>
> Best,
>
> Ilya Faenson
> Rockville, MD USA
>
>
> —
> NTDEV is sponsored by OSR
>
> For our schedule of WDF, WDM, debugging and other seminars visit:
> http://www.osr.com/seminars
>
> To unsubscribe, visit the List Server section of OSR Online at
> http://www.osronline.com/page.cfm?name=ListServer
>

NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online
at http://www.osronline.com/page.cfm?name=ListServer

NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online
at http://www.osronline.com/page.cfm?name=ListServer

Michal_Vodicka-2 · August 5, 2009, 4:11pm

Praveen,

> We just set timeout to 550 sec. Using undocumented
registry value.

Do you do this only on the DTM machine? Similarly regarding
thread priority - do you modify it only on the test system?

You shouldn’t have to (and shouldn’t) modify shipping
code/inf to make the test pass - much less so with an
undocumented regkey. We should find some other way for that.

Yes, I modified timeout for shipping INF. What else could we do? Our
customers are OEMs who install our signed driver at their machines and
run DTM System tests there. With the default timeout our driver is
marked as the reason of the test failure.

I filed Win7 bug for this (ID 480584) and there is no response, yet.
They wanted immediate solution and we already signed driver with changed
timeout for them. We can’t ask them to modify timeout at DTM machines.
INF change would break our signature and they wouldn’t accept manual
registry change. They wouldn’t accept any registry change, they want to
run test at the machine as it is going to production.

I see no other way as code for Win7 is already closed and we can’t
expect the fix in timely (i.e. days) manner, I guess.

I didn’t play with thread priorities, yet. For us it was better to make
a fix with INF change only. We’re also in pre-production phase.

Regarding driver restart, yes during testing the restart
count should be set to 0 as you suggested. We are in the
process of documenting it per your feedback.

Thanks.

Tuning the timeouts and restart policy will be upon UMDF
team. Similarly your other feedback is well taken and is
valuable. Although the standard disclaimer of not being able
to promise any specific changes or timelines applies.

Thanks, sounds good to me. I believe extending timeout is sufficient
short term solution, do you see any harm it can cause? I set it
intentionally to less than 10 minutes to avoid BSOD because of power IRP
timeout in kernel.

ps: The links I mentioned were more for info/reference and
not to imply that you should have looked at them :-).

No problem

Best regards,

Michal Vodicka
UPEK, Inc.
[xxxxx@upek.com, http://www.upek.com]

OSR_Community_User · August 5, 2009, 9:29pm

> What else could we do?

You can probably get an exception for this failure based on the bug that you filed.

Btw I didn’t mean that you should change thread priorities. Both changing INF and thread priorities should be avoided.

We are trying to understand why you ran into such an issue but the internal UMDF drivers haven’t. The owner of this test tells me that I/O stress is completed before enable/disable of devices. Only thing which might be going on in parallel to your device start is post-request-completion activity of other driver(s) - I don’t know if it would be heavy enough to bog down the CPU for 1 minute.

Would you know what is pegging the CPU during the start of your device?

Thanks,
Praveen

ps: Michal, would you mind if we take this investigation offline and post a summary to ntdev later?