ANy tools available to debug system hang? (other than driver verifier?)

seetha_rama_sarma_Jayanthy · September 8, 2014, 1:14pm

Hi,

When i am running tests on my driver, at certain point the system freezes. I am unable to get any hint because there is no crash dump.

Are there any tools available to debug a system freeze? I tried enabling driver verifier (verifier.exe) by enabling all the checks but still the system doesn’t generate a crash dump.

Seetharama Sarma Jayanthy

Tim_Roberts · September 8, 2014, 1:30pm

xxxxx@gmail.com wrote:

You are right…Mine is a PCI-e based device. I am using legacy interrupts.

Is this a serial adapter that you have created? One common way to cause
this kind of freeze is an interrupt storm, where your hardware generates
an interrupt that your ISR does not clear. As soon as your ISR returns,
the interrupt is immediately raised again. Are you handling the
possibility that your hardware fires multiple interrupts before your ISR
can run? Are you making sure to check ALL interrupt sources, every time
the ISR runs?

When i am running tests on my driver, at certain point the system freezes. I am unable to get any hint because there is no crash dump.

Are there any tools available to debug a system freeze? I tried enabling driver verifier (verifier.exe) by enabling all the checks but still the system doesn’t generate a crash dump.

Are you running with a kernel debugger? If you really have an interrupt
storm, the debugger won’t be able to get in, either, but it’s worth trying.

–
Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

Peter_Viscarola_OSR · September 8, 2014, 1:57pm

Have you tried forcing a crash from the keyboard?
http://msdn.microsoft.com/en-us/library/windows/hardware/ff545499(v=vs.85).aspx

Peter
OSR
@OSRDrivers

Jan_Bottorff · September 8, 2014, 5:50pm

On some systems you can also force an NMI (non maskable interrupt), which you can configure to cause a crash dump. There are a variety of ways to generate the NMI.

I know Dell servers I used to work on had a way to generate NMIs from the GUI remote admin interface.

Some motherboards (mostly servers?) have some headers you can connect a with to that causes an NMI.Some systems can also generate an NMI based on watchdog expiration. There is sometimes a configuration option to select between reset and NMI on watchdog expiration.

For PCI bus computers, there used to be a little add-in board you could get that forced an NMI. Not sure if this is available for PCIe though due to the more complex signalling.

You might also add support in your driver to trigger a crash dump when some criteria is met, like your ISR is called 100K times without some flag getting set by other code in the your driver. You can also run the full checked OS, which asserts if the interrupt rate is too high (it assumes it?s an interrupt storm). One strategy I once used long ago was to create a worker thread which raised it?s IRQL to HIGH_LEVEL (or something similar), and it spun polling some flags I put in the driver to detect progress (the lack of which was assumed to be a hang), and caused a crash dump if certain criteria were met. Note that essentially taking over one core tended to cause it’s own side effects, so spinning at HIGH-LEVEL and periodically dropping back to PASSIVE_LEVEL helped a bit.

I suppose if you are a hardware vendor, for programmable devices you might be able to add some way to generate an NMI to the device firmware.

There are some kinds of hangs that you can?t break into, like in PCI buses a device used to be able to cause an infinite read retry. A hang caused by things like incorrect power management might be pretty hard to debug, so for example power to memory is incorrectly turned off, so the processor starts executing garbage, including the NMI vector.

If you have really deep pockets I believe Intel has hardware trace facilities on some processors. I see Arium has merged with another company http://www.asset-intertech.com/Products/SourcePoint/SourcePoint-for-Intel

Jan

On Sep 8, 2014, at 10:56 AM, wrote:

> Have you tried forcing a crash from the keyboard?
> http://msdn.microsoft.com/en-us/library/windows/hardware/ff545499(v=vs.85).aspx
>
> Peter
> OSR
> @OSRDrivers
>
>
> —
> NTDEV is sponsored by OSR
>
> Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev
>
> OSR is HIRING!! See http://www.osr.com/careers
>
> For our schedule of WDF, WDM, debugging and other seminars visit:
> http://www.osr.com/seminars
>
> To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer

Peter_Viscarola_OSR · September 8, 2014, 6:23pm

Don’t even START down that path…

If somebody would like to buy our Arium, I’ll be more than happy to trade you for a reasonable denomination Arch Card http:

Peter
OSR
@OSRDrivers</http:>

mm1 · September 8, 2014, 7:13pm

+1

[It’s probably been four years since I’ve had to suffer through using an
Arium, so this information is all old.]

While an Arium is your only real option for a few things, it is still
absolutely the option of last resort. Getting one to work with the average
board these days is a very low probability event.

Unless you have the appropriate level of NDA with Intel (not sure how it
works with AMD), you’re going to have a hell of a time getting a suitable
CRB, in my experience. They will not help you and what you will be able to
get ahold of will likely be limited and old.

I believe that Arium finally stopped advertising the interposer interface.
For a while they kicked the problem out the door to a company that sold
interposers (for $2K - $K) that Arium finally admitted (privately at least)
didn’t really work.

You might be able to find a board normally used for embedded work that has a
socket, but it is likely to very specific and probably kind of limited.

Otherwise, you’ll need a board that you happen to know has active traces on
it and somebody who knows what they are doing to attach an xdp socket to it.
Off the top of my head, I want to say that an XDP socket has 60 pins, but
whatever the exact pin count, it’s dense enough that your home soldering
iron is not really an option and there’s going to be some resistors
involved, IIRC.

Any or all of these problems may have been fixed, I don’t know, but even on
the best of days, on a board that comes with an XDP socket already mounted,
the connection between the Arium and the tap is dicey and SourcePoint is
very limited.

mm

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of xxxxx@osr.com
Sent: Monday, September 08, 2014 3:24 PM
To: Windows System Software Devs Interest List
Subject: RE:[ntdev] ANy tools available to debug system hang? (other than
driver verifier?)

Don’t even START down that path…

If somebody would like to buy our Arium, I’ll be more than happy to trade
you for a reasonable denomination Arch Card http:

Peter
OSR
@OSRDrivers

—
NTDEV is sponsored by OSR

Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer</http:>

seetha_rama_sarma_Jayanthy · September 9, 2014, 2:09pm

HI,

I tried manually crashing the system but the problem is that when the system is freezing even the keyboard becomes unresponsive. Hence i couldn’t generated a crash dump.

Also i believe that interrupt storm is not happening because this issue is reproduced only when the UART port settings are changed to generate interrupts frequently. Below is some elaboration on my issue.

The system freeze issue occurs when we run the test simultaneously on 3 ports. The test runs fine on a single port and passes frequently on two ports. When we change the serial port configuration to generate interrupts less frequently, this issue is seen very rarely. Below is some elaboration on this.

There are two registers in our serial port which operate as FIFO and buffer the content that is received or to be transmitted. Tx FIFO register buffers the contents to be transmitted and Rx FIFO register buffers the contents that is received. These registers are of size 128 bytes.
Now when the UART receives data in Rx FIFO, if the FIFO is filled to certain extent (like 16 bytes, 96 bytes etc )the UART generates an interrupt to our driver. (This level is called a Rx trigger level). In this interrupt we copy the data from the FIFO to our driver. This interrupt is called an Rx DATA interrupt.
The same applies to Tx FIFO register. When the data in Tx FIFO is lesser than certain level (i.e. lesser than 16 bytes , lesser than 50 bytes etc…) the UART generates another interrupt called Tx Empty interrupt. (This level is called a Tx trigger level)In this interrupt we write the data into Tx FIFO.
Now these UARTs can be programmed with our choice of Rx Trigger levels and Tx trigger levels. When Rx trigger level is set to a high value the UART buffers more data in Rx FIFO and hence generates less interrupts. Also when the Tx trigger level is set to a high value then UART buffers less data in Rx FIFO and hence generates more interrupts.

With the above knowledge, I would like to narrow down our problem. We are seeing this problem of system freeze only when we are setting low values of Rx trigger and high values of Tx trigger. TO be precise, When we set Rx Trigger to 16 and Tx trigger to 112 we are running into this problem consistently when we run the test with 3 ports. (It works fine with one port though).

However when we set the Tx trigger value to 16 and Rx trigger value to 96 then we are not seeing this issue. Hence it is evident that when there are more interrupts being generated, there is an issue. Looks like our driver is unable to cope up with the rate at which interrupts are generated. However we observed the values of CPU usage and memory at the time of system freeze they were pretty normal. CPU usage was around 10 % and memory usage was about 590 MB as compared to a total of 4 GB.

Another observation we have is that , while running the test application when we allowed the output to be printed on the console we are seeing this issue less frequently. i.e. the test runs smoothly for more number of ports now. However when we redirect the output of the test application to a text file then this issue is reproduced fairly quickly.

While we are allowing the output to be printed to console the CPU utilization is high. (while testing is running on one port the CPU utilization goes up to 30%). When we redirect the output to a log file the CPU utilization is about 1%. This gives us a thought that in the earlier case read and write commands are being fired at a slower pace by the application where are in the latter case these are fired at a quicker rate.

which areas in my driver i should be looking at to determine the root cause of this issue?

Tim_Roberts · September 10, 2014, 1:11pm

xxxxx@gmail.com wrote:

Also i believe that interrupt storm is not happening because this issue is reproduced only when the UART port settings are changed to generate interrupts frequently.

I find that statement astounding. If you KNOW the issue changes based
on interrupt frequency, then how can you possibly deny that the problem
is interrupt related?

I really see two possibilities: one, an interrupt storm (an interrupt
that remains unacknowledged); two, a protocol violation on the bus.

In general, it is quite difficult for a driver by itself to cause a
freeze that also locks out keyboard and mouse. You are almost certainly
triggering some kind of hardware issue.

There are two registers in our serial port which operate as FIFO and buffer the content that is received or to be transmitted. Tx FIFO register buffers the contents to be transmitted and Rx FIFO register buffers the contents that is received. These registers are of size 128 bytes.

Do all of your serial ports share a single interrupt? In your interrupt
handler, do you check for all potential interrupt sources before
returning? That is, if you happen to find that “port 1 Rx FIFO full”
fired, do you handle it and then immediately return, or do you continue
to check and clear the other interrupt sources? Is your interrupt
shared with other devices? Are you handling that possibility?

However when we set the Tx trigger value to 16 and Rx trigger value to 96 then we are not seeing this issue. Hence it is evident that when there are more interrupts being generated, there is an issue. Looks like our driver is unable to cope up with the rate at which interrupts are generated.

What is the data rate? How many interrupts per second are you
handling? And, by the way, you really should know that number right off
the top of your head.

While we are allowing the output to be printed to console the CPU utilization is high. (while testing is running on one port the CPU utilization goes up to 30%). When we redirect the output to a log file the CPU utilization is about 1%. This gives us a thought that in the earlier case read and write commands are being fired at a slower pace by the application where are in the latter case these are fired at a quicker rate.

Right. Writing to and scrolling the console window is a time-consuming
operation. You are probably getting many fewer interrupts.

which areas in my driver i should be looking at to determine the root cause of this issue?

You should be looking at your ISR and your DPC, and you should be
renting a PCIExpress bus analyzer.

–
Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

NtDev_Geek · September 10, 2014, 2:42pm

You may use AdPlus.exe in your driver kit.follow the documentation properly.
Hope it’ll help.

./D

seetha_rama_sarma_Jayanthy · September 11, 2014, 10:39am

Also i believe that interrupt storm is not happening because this issue is
reproduced only when the UART port settings are changed to generate interrupts
frequently.

I find that statement astounding. If you KNOW the issue changes based
on interrupt frequency, then how can you possibly deny that the problem
is interrupt related?

The problem may be related to interrupts, but it seems to me like not clearing a particular interrupt condition is a logical bug in the code and hence it should reproduce every time there that particular interrupt is generated. However the issue is reproducing only randomly. (Very few times i have even seen that issue is not reproducing.)

I will for a moment assume that with a faster rate of interrupts, there is some sort of data error occurring at the hardware side and hence hardware is raising a particular interrupt to signal that condition and perhaps i am not handling this condition. However

When the interrupt storm really happened (i did run into this problem during early stages of development and i fixed it subsequently) the system was slowed down for a while and eventually it crashed. More importantly while the system was slowed down i did see that the keyboard and mouse were still responding.
The hardware has a register which provides the status of the interrupt on the hardware. Upon reading this register the interrupt condition on this hardware will be cleared(regardless of which interrupt was fired). I am reading this register at the beginning of my ISR. So the interrupt condition on hardware must be getting cleared.

Do all of your serial ports share a single interrupt? In your interrupt
handler, do you check for all potential interrupt sources before
returning? That is, if you happen to find that “port 1 Rx FIFO full”
fired, do you handle it and then immediately return, or do you continue
to check and clear the other interrupt sources? Is your interrupt
shared with other devices? Are you handling that possibility?

8 of my serial ports share a single interrupt because all of them are present on a single controller. after handling one particular interrupt in the ISR (like “port 1 Rx FIFO full” as given above) i still go and check the further interrupts.

My interrupt is shared between devices. As soon as i receive an interrupt i check one of my hardware registers whether there is an interrupt condition on the hardware. If yes i handle all of those conditions and return TRUE from ISR. If not i will return FALSE from my ISR straight away. I hope this is what is meant by handling the interrupt that is shared between devices.

What is the data rate? How many interrupts per second are you
handling? And, by the way, you really should know that number right off
the top of your head

The data rate is 38400 bps. I will check the number of interrupts per second using some performance analyzer tool and will update.

kumarjan · September 12, 2014, 7:07am

Check for clock configuration. Register, may be. Interrupt generator is
gone for a toss

Regards
Kumon
On 11-Sep-2014 8:09 PM, wrote:

> Also i believe that interrupt storm is not happening because this issue is
> reproduced only when the UART port settings are changed to generate
> interrupts
> frequently.
>
> >I find that statement astounding. If you KNOW the issue changes based
> on interrupt frequency, then how can you possibly deny that the problem
> is interrupt related?
>
> The problem may be related to interrupts, but it seems to me like not
> clearing a particular interrupt condition is a logical bug in the code and
> hence it should reproduce every time there that particular interrupt is
> generated. However the issue is reproducing only randomly. (Very few times
> i have even seen that issue is not reproducing.)
>
> I will for a moment assume that with a faster rate of interrupts, there is
> some sort of data error occurring at the hardware side and hence hardware
> is raising a particular interrupt to signal that condition and perhaps i am
> not handling this condition. However
>
> 1. When the interrupt storm really happened (i did run into this problem
> during early stages of development and i fixed it subsequently) the system
> was slowed down for a while and eventually it crashed. More importantly
> while the system was slowed down i did see that the keyboard and mouse were
> still responding.
>
> 2. The hardware has a register which provides the status of the interrupt
> on the hardware. Upon reading this register the interrupt condition on this
> hardware will be cleared(regardless of which interrupt was fired). I am
> reading this register at the beginning of my ISR. So the interrupt
> condition on hardware must be getting cleared.
>
> > Do all of your serial ports share a single interrupt? In your interrupt
> handler, do you check for all potential interrupt sources before
> returning? That is, if you happen to find that “port 1 Rx FIFO full”
> fired, do you handle it and then immediately return, or do you continue
> to check and clear the other interrupt sources? Is your interrupt
> shared with other devices? Are you handling that possibility?
>
> 8 of my serial ports share a single interrupt because all of them are
> present on a single controller. after handling one particular interrupt in
> the ISR (like “port 1 Rx FIFO full” as given above) i still go and check
> the further interrupts.
>
> My interrupt is shared between devices. As soon as i receive an interrupt
> i check one of my hardware registers whether there is an interrupt
> condition on the hardware. If yes i handle all of those conditions and
> return TRUE from ISR. If not i will return FALSE from my ISR straight away.
> I hope this is what is meant by handling the interrupt that is shared
> between devices.
>
> > What is the data rate? How many interrupts per second are you
> handling? And, by the way, you really should know that number right off
> the top of your head
>
> The data rate is 38400 bps. I will check the number of interrupts per
> second using some performance analyzer tool and will update.
>
>
>
>
>
>
> —
> NTDEV is sponsored by OSR
>
> Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev
>
> OSR is HIRING!! See http://www.osr.com/careers
>
> For our schedule of WDF, WDM, debugging and other seminars visit:
> http://www.osr.com/seminars
>
> To unsubscribe, visit the List Server section of OSR Online at
> http://www.osronline.com/page.cfm?name=ListServer
>

seetha_rama_sarma_Jayanthy · September 12, 2014, 10:45am

I tried generating a crash dump manually with the keyboard. I crashed the system just a moment before it got frozen so i did get a crash dump.

However when i tried to analyze that crash dump i could not find anything beyond what is very obvious. Debugger said that it is a manually initiated crash and showed me a keyboard driver stack.

could somebody let me know how to inspect the system state using this crash dump in Windbg?

I mean what are the commands that i should be running etc…

Meanwhile I am trying to setup remote debugging for my system.

Scott_Noone_OSR · September 16, 2014, 10:19am

Crashing the machine before the hang isn’t likely to provide you much
interesting information, you really want to see the state of things when the
system is locked up. Your best bet is really to get hooked up to a live
debug session, as long as you’re not preventing the clock from ticking
you’ll be able to break in during the hang.

In any case…When investigating a hang, my usual sequence is:

“!running -ti” to look at what’s running on the processors
“!ready f” to look at the Ready Lists of the processors (i.e. what’s
ready to run)
“!stacks 2” to get a summary view of all threads in the system

If none of that jumps out at you it gets more complicated. We have a custom
extension !apexts.uniqstack that I often use to show only unique call chains
in the system, which usually does a good job of picking out threads that are
“interesting”
(http://www.osronline.com/OsrDown.cfm/apexts.zip?name=apexts.zip&id=559).
Though for a hard hang/livelock it’s usually sufficient to just look at the
processors with !running.

-scott
OSR
@OSRDrivers

wrote in message news:xxxxx@ntdev…

I tried generating a crash dump manually with the keyboard. I crashed the
system just a moment before it got frozen so i did get a crash dump.

However when i tried to analyze that crash dump i could not find anything
beyond what is very obvious. Debugger said that it is a manually initiated
crash and showed me a keyboard driver stack.

could somebody let me know how to inspect the system state using this crash
dump in Windbg?

I mean what are the commands that i should be running etc…

Meanwhile I am trying to setup remote debugging for my system.

Michael_Jones · September 16, 2014, 1:53pm

On 9/16/2014 10:18 AM, Scott Noone wrote:

the system is locked up. Your best bet is really to get hooked up to a
live debug session, as long as you’re not preventing the clock from
ticking you’ll be able to break in during the hang.

I’ve got a situation where the system hangs (not consistently in the
same place). I’ve got Windbg hooked up, but when the hang occurs, I
cannot break in.

What can I do in this case? I’ve got a request in to the client to see
if it’s possible to generate an NMI on this hardware. If that is
possible, is that my best way forward? Is there anything else I can try?

The target is Windows 7 64-bit, and Windbag is connected over 1394.

TIA for any suggestions,

– mkj

//
// Michael K. Jones
// Stone Hill Consulting, LLC
// http://www.stonehill.com
//_______________________________________________

Scott_Noone_OSR · September 16, 2014, 5:20pm

If you’ve done something to stop the clock (e.g. a bus hang) you’re pretty
much out of luck. An NMI might help, but there’s no guarantee that the NMI
will get through either. Some alternate paths:

If you have a hardware device, a bus trace using an appropriate analyzer
leading up to the hang can be a lifesaver.
DbgPrint! If you can easily reproduce the hang, you can just keep adding
“here!” traces until you get closer to the point of hanging. VERY low tech,
but it can work in a bind. If you don’t have access to a bus analyzer you
could also instrument your register/port access functions to DbgPrint each
device access (with data). Time consuming and painful to keep adding
DbgPrints and rebooting, but this is the kind of situation where we earn our
money
In a virtual environment you may be able to pause the VM and generate a
WinDbg DMP file from the snapshot (see vmss2core). Not very helpful if
you’re not working with virtual hardware, but it’s an option.

Good luck!

-scott
OSR
@OSRDrivers

“Michael Jones” wrote in message news:xxxxx@ntdev…

On 9/16/2014 10:18 AM, Scott Noone wrote:

the system is locked up. Your best bet is really to get hooked up to a
live debug session, as long as you’re not preventing the clock from
ticking you’ll be able to break in during the hang.

I’ve got a situation where the system hangs (not consistently in the
same place). I’ve got Windbg hooked up, but when the hang occurs, I
cannot break in.

What can I do in this case? I’ve got a request in to the client to see
if it’s possible to generate an NMI on this hardware. If that is
possible, is that my best way forward? Is there anything else I can try?

The target is Windows 7 64-bit, and Windbag is connected over 1394.

TIA for any suggestions,

– mkj

//
// Michael K. Jones
// Stone Hill Consulting, LLC
// http://www.stonehill.com
//_______________________________________________

Alex_Grig · September 16, 2014, 5:30pm

>If you’ve done something to stop the clock (e.g. a bus hang) you’re pretty
much out of luck.

Put the board on the hotplug header and initiate disconnect/power off. Or just pull it (hope it won’t fry).