Root cause of intermittent restarts on Windows XP Embedded.

Hi All,

I’m working on a appliance based product which runs windows XPE (SP1). The
OS never got updated to the latest XPE for historic reason and I don’t have
control over that.

The XPE we have is very light weight and customized to have only required
windows components. We have our own NAT driver which we are hooking through
DNE.

The appliance has watchdog which will restart the system in case the
application is hung or crashed. This is done to remove any manual
intervention and reduce the downtime.

One of our customer is experiencing a intermittent restarts of the system
and from our understanding there were no crash in the application nor it was
hung (we know this for sure). We’ve no other clues to figure out what is
causing the reboot.

Few things that came to our mind are:
1.Bad RAM
2.Bad hardware components
3.BSOD
4.Power spikes
5.Buggy driver

We are currently working on creating a debug build which can provide us more
clue to this mystery restarts. We are enabling windump option in ‘system
recovery’ component, adding debug statements to our NAT driver, disabling
the watchdog in order to keep the system in hung position when something
goes bad instead of restarting it, etc.

However I wanted to know from the experts about how to go about finding the
root cause of this mystery issue and any free advice would be really
appreciated. Are there any other reasons for windows to get automatically
restarted? can something go wrong at network layer level which might be
causing the OS to reboot?

BTW I’m new to windows system programming and debugging and thought there
could be more to it than what I’m suspecting.

I can provide more details if you need,
Thanks,
-Phani.

Well, it sounds like you’ve already done most of them.

  • Make sure that you tell Windows to create a FULL MEMORY DUMP on
    crash

  • Disable the ‘autoreboot on crash’ feature

  • If a crash dump ends up getting generated, I’d establish a kd
    session if you have access to the system, assuming that you know something
    about the kernel

  • Check the event log. It’s generally a waste of time, but it’s
    always a good idea to check

Good luck,

mm

From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Phaniraj Raghavendra
Sent: Thursday, August 12, 2010 5:05 AM
To: Kernel Debugging Interest List
Subject: [windbg] Root cause of intermittent restarts on Windows XP
Embedded.

Hi All,

I’m working on a appliance based product which runs windows XPE (SP1). The
OS never got updated to the latest XPE for historic reason and I don’t have
control over that.

The XPE we have is very light weight and customized to have only required
windows components. We have our own NAT driver which we are hooking through
DNE.

The appliance has watchdog which will restart the system in case the
application is hung or crashed. This is done to remove any manual
intervention and reduce the downtime.

One of our customer is experiencing a intermittent restarts of the system
and from our understanding there were no crash in the application nor it was
hung (we know this for sure). We’ve no other clues to figure out what is
causing the reboot.

Few things that came to our mind are:
1.Bad RAM
2.Bad hardware components
3.BSOD
4.Power spikes
5.Buggy driver

We are currently working on creating a debug build which can provide us more
clue to this mystery restarts. We are enabling windump option in ‘system
recovery’ component, adding debug statements to our NAT driver, disabling
the watchdog in order to keep the system in hung position when something
goes bad instead of restarting it, etc.

However I wanted to know from the experts about how to go about finding the
root cause of this mystery issue and any free advice would be really
appreciated. Are there any other reasons for windows to get automatically
restarted? can something go wrong at network layer level which might be
causing the OS to reboot?

BTW I’m new to windows system programming and debugging and thought there
could be more to it than what I’m suspecting.

I can provide more details if you need,
Thanks,
-Phani.
— WINDBG is sponsored by OSR For our schedule of WDF, WDM, debugging and
other seminars visit: http://www.osr.com/seminars To unsubscribe, visit the
List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

To debug that system, I’d do something about your watchdog timer, which I assume is hardware-based.

The best solution, which we’ve done in our system for 10+ years, is not to have the watchdog reset the system, but to do an NMI, wait 10 minutes or so for the memory.dmp, and then reset the system. By getting a dump, you’re then able to debug any hang issues, such as the CPU getting stuck in a driver loop. Yes, it’s slows the recovery process, but you’re saving future incidents since you can then find and fix the problem.

But you may not be able to do that at this point, so I’d recommend that you enable the keyboard-driven crash (http://msdn.microsoft.com/en-us/library/ff545499(VS.85).aspx) and turn off your watchdog. Then have the customer do the keyboard sequence when the problem occurs. That will then give you a dump that you can examine and figure out the issue.