Need ideas to debug interrupt latency

I have a driver I’ve been helping a client debug. The driver is for a Can chip that has a pretty small buffer. It was frequently dropping packets, and I debugged the problem to the point where I could see the interrupts weren’t being serviced quickly enough. We put a logic analyzer on the system, and we can see that the time between the chip asserting the interrupt pin and the entry into our OnInterrupt routine is normally about 50 microseconds. But in some situations the time is 500 usec, and occasionally we see it in the 2-14 millisecond range.

This is an XP system (actually XP Embedded), and a previous build doesn’t exhibit this problem. I’m thinking there must be some other driver or system configuration that is causing the problem now. But none of the obvious stuff has changed. The drivers all appear to be the same between the version that works and the one that doesn’t.

Does anyone have any ideas on how to track this problem down? I’ve tried kernrate, but the problem is when this behavior is happening the number of interrupts changes due to the fact that communications are disrupted.

I don’t know if this is a clue or not, but we found that if we plug a USB flash drive in, and then un-plug it, the problem is very bad. As soon as the drive is plugged back in things are normal.

I hope this isn’t considered off topic. It’s really a problem of how to make the system work normally, but it’s keeping this particualr driver from working at all.

Thanks in advance!

Kevin Bentley wrote:

We put a logic analyzer on the system, and we can see
that the time between the chip asserting the interrupt pin
and the entry into our OnInterrupt routine is normally about
50 microseconds. But in some situations the time is 500 usec,
and occasionally we see it in the 2-14 millisecond range.

I suspect most of the answers you will get will center around the theme of “Windows is not a real-time OS, what you are seeing is expected behavior”.

My first quesion is what happens when the ISR is invoked? A poorly written
ISR can “loose” interrupts on even the slowest systems.

How long does the ISR take to release the interrupts and allow the next
entry? Does it pass processing to a DPC with minimal procesing in the ISR?
Does the ISR check to be sure all interrupts have been processed? Does it
use some type of mechanism to save volatile interrupt information for later
DPC processing? I’ve seen many complaints concerning “lost” interrupts that
were caused by simply attempting to hang to many requests on the DPC queue.
Subsequent requests end up in the bit bucket and since interrupt information
may be volatile and not saved, the interrupt appeared to have been lost.


The personal opinion of
Gary G. Little

wrote in message news:xxxxx@ntdev…
>I have a driver I’ve been helping a client debug. The driver is for a Can
>chip that has a pretty small buffer. It was frequently dropping packets,
>and I debugged the problem to the point where I could see the interrupts
>weren’t being serviced quickly enough. We put a logic analyzer on the
>system, and we can see that the time between the chip asserting the
>interrupt pin and the entry into our OnInterrupt routine is normally about
>50 microseconds. But in some situations the time is 500 usec, and
>occasionally we see it in the 2-14 millisecond range.
>
> This is an XP system (actually XP Embedded), and a previous build doesn’t
> exhibit this problem. I’m thinking there must be some other driver or
> system configuration that is causing the problem now. But none of the
> obvious stuff has changed. The drivers all appear to be the same between
> the version that works and the one that doesn’t.
>
> Does anyone have any ideas on how to track this problem down? I’ve tried
> kernrate, but the problem is when this behavior is happening the number of
> interrupts changes due to the fact that communications are disrupted.
>
> I don’t know if this is a clue or not, but we found that if we plug a USB
> flash drive in, and then un-plug it, the problem is very bad. As soon as
> the drive is plugged back in things are normal.
>
> I hope this isn’t considered off topic. It’s really a problem of how to
> make the system work normally, but it’s keeping this particualr driver
> from working at all.
>
> Thanks in advance!
>

On 11/14/07, Gary G. Little wrote:
>
> My first quesion is what happens when the ISR is invoked? A poorly written
> ISR can “loose” interrupts on even the slowest systems.
>
> How long does the ISR take to release the interrupts and allow the next
> entry? Does it pass processing to a DPC with minimal procesing in the ISR?
> Does the ISR check to be sure all interrupts have been processed? Does it
> use some type of mechanism to save volatile interrupt information for
> later
> DPC processing? I’ve seen many complaints concerning “lost” interrupts
> that
> were caused by simply attempting to hang to many requests on the DPC
> queue.
> Subsequent requests end up in the bit bucket and since interrupt
> information
> may be volatile and not saved, the interrupt appeared to have been lost.

The device generates an interrupt for every 8 bytes received, so the ISR
just reads the bytes off the chip, puts them into a FIFO, and clears the
interrupt. It doesn’t use a DPC because the amount of data is so small it’s
really easy to just buffer it up for the application to read (note: I didn’t
design this driver, and I’m not saying this is a good approach, but the ISR
is very simple).

There was a bug in this driver that’s exactly what you describe. The driver
was expecting the interrupt to happen 1-1 with frames to be read off the
chip. The chip has a small buffer though, so I changed it to keep reading
data off the chip if there is more, and not assume it’s just 1 frame for
each interrupt. It never takes more than a few microseconds for the ISR to
complete (verified with a logic analyzer).

The problem is apparent with the logic analyzer. It takes many milliseconds
for the OnInterrupt to be called once the interrupt is asserted. You can see
that once the chip is accessed (by looking at it’s chip select pin with the
logic analyzer), the interrupt is cleared very quickly. The problem is the
delay between the interrupt being asserted and OnInterrupt happening.

What is odd is a previous build of this device worked fine. We’re trying to
troubleshoot what changed here to cause this problem, but nothing has shown
up yet.

Thanks for your help.

Kevin Bentley

The first thing to realize is that you can’t control the delay in the general case, it’s just the way it is with Windows.

That being said, my guess is that either a) you’re line based and sharing an interrupt line with a chatty device or b) the you are or aren’t line based and the HAL has assigned a chatty device’s ISR a higher IRQL than yours. I’d start by investigating A first, you can determine it either through the info in Device Mangler or the !idt output from WinDBG.

Also, you might want to try running this to find other ISRs that might be taking a bit too long: http://www.microsoft.com/whdc/Driver/tips/DPC_ISR.mspx

-scott

Scott Noone
Software Engineer
OSR Open Systems Resources, Inc.
http://www.osronline.com

“Kevin Bentley” wrote in message news:xxxxx@ntdev…
On 11/14/07, Gary G. Little wrote:
My first quesion is what happens when the ISR is invoked? A poorly written
ISR can “loose” interrupts on even the slowest systems.

How long does the ISR take to release the interrupts and allow the next
entry? Does it pass processing to a DPC with minimal procesing in the ISR?
Does the ISR check to be sure all interrupts have been processed? Does it
use some type of mechanism to save volatile interrupt information for later
DPC processing? I’ve seen many complaints concerning “lost” interrupts that
were caused by simply attempting to hang to many requests on the DPC queue.
Subsequent requests end up in the bit bucket and since interrupt information
may be volatile and not saved, the interrupt appeared to have been lost.

The device generates an interrupt for every 8 bytes received, so the ISR just reads the bytes off the chip, puts them into a FIFO, and clears the interrupt. It doesn’t use a DPC because the amount of data is so small it’s really easy to just buffer it up for the application to read (note: I didn’t design this driver, and I’m not saying this is a good approach, but the ISR is very simple).

There was a bug in this driver that’s exactly what you describe. The driver was expecting the interrupt to happen 1-1 with frames to be read off the chip. The chip has a small buffer though, so I changed it to keep reading data off the chip if there is more, and not assume it’s just 1 frame for each interrupt. It never takes more than a few microseconds for the ISR to complete (verified with a logic analyzer).

The problem is apparent with the logic analyzer. It takes many milliseconds for the OnInterrupt to be called once the interrupt is asserted. You can see that once the chip is accessed (by looking at it’s chip select pin with the logic analyzer), the interrupt is cleared very quickly. The problem is the delay between the interrupt being asserted and OnInterrupt happening.

What is odd is a previous build of this device worked fine. We’re trying to troubleshoot what changed here to cause this problem, but nothing has shown up yet.

Thanks for your help.

Kevin Bentley

Is any other device sharing this interrupt?


From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Kevin Bentley
Sent: Wednesday, November 14, 2007 5:01 PM
To: Windows System Software Devs Interest List
Subject: Re: [ntdev] Need ideas to debug interrupt latency

On 11/14/07, Gary G. Little wrote:

My first quesion is what happens when the ISR is invoked? A poorly
written
ISR can “loose” interrupts on even the slowest systems.

How long does the ISR take to release the interrupts and allow the next
entry? Does it pass processing to a DPC with minimal procesing in the
ISR?
Does the ISR check to be sure all interrupts have been processed? Does
it
use some type of mechanism to save volatile interrupt information for
later
DPC processing? I’ve seen many complaints concerning “lost” interrupts
that
were caused by simply attempting to hang to many requests on the DPC
queue.
Subsequent requests end up in the bit bucket and since interrupt
information
may be volatile and not saved, the interrupt appeared to have been lost.

The device generates an interrupt for every 8 bytes received, so the ISR
just reads the bytes off the chip, puts them into a FIFO, and clears the
interrupt. It doesn’t use a DPC because the amount of data is so small
it’s really easy to just buffer it up for the application to read (note:
I didn’t design this driver, and I’m not saying this is a good approach,
but the ISR is very simple).

There was a bug in this driver that’s exactly what you describe. The
driver was expecting the interrupt to happen 1-1 with frames to be read
off the chip. The chip has a small buffer though, so I changed it to
keep reading data off the chip if there is more, and not assume it’s
just 1 frame for each interrupt. It never takes more than a few
microseconds for the ISR to complete (verified with a logic analyzer).

The problem is apparent with the logic analyzer. It takes many
milliseconds for the OnInterrupt to be called once the interrupt is
asserted. You can see that once the chip is accessed (by looking at it’s
chip select pin with the logic analyzer), the interrupt is cleared very
quickly. The problem is the delay between the interrupt being asserted
and OnInterrupt happening.

What is odd is a previous build of this device worked fine. We’re trying
to troubleshoot what changed here to cause this problem, but nothing has
shown up yet.

Thanks for your help.

Kevin Bentley

— NTDEV is sponsored by OSR For our schedule of WDF, WDM, debugging
and other seminars visit: http://www.osr.com/seminars To unsubscribe,
visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

“Kevin Bentley” wrote in message
news:xxxxx@ntdev…
> On 11/14/07, Gary G. Little wrote:
>>
>> My first quesion is what happens when the ISR is invoked? A poorly
>> written
>> ISR can “loose” interrupts on even the slowest systems.
>>
>> How long does the ISR take to release the interrupts and allow the next
>> entry? Does it pass processing to a DPC with minimal procesing in the
>> ISR?
>> Does the ISR check to be sure all interrupts have been processed? Does it
>> use some type of mechanism to save volatile interrupt information for
>> later
>> DPC processing? I’ve seen many complaints concerning “lost” interrupts
>> that
>> were caused by simply attempting to hang to many requests on the DPC
>> queue.
>> Subsequent requests end up in the bit bucket and since interrupt
>> information
>> may be volatile and not saved, the interrupt appeared to have been lost.
>
>
> The device generates an interrupt for every 8 bytes received, so the ISR
> just reads the bytes off the chip, puts them into a FIFO, and clears the
> interrupt. It doesn’t use a DPC because the amount of data is so small
> it’s
> really easy to just buffer it up for the application to read (note: I
> didn’t
> design this driver, and I’m not saying this is a good approach, but the
> ISR
> is very simple).

In your first post, you called it a Can chip. Not sure what that is, but
sounds like a serial comm device from your description here. Since your
device appears to be more sensitive to latency than most, this design
appears to be a reasonable attempt to compensate for it. Especially given
the small amount of time you observed it spending inside the ISR.

> There was a bug in this driver that’s exactly what you describe. The
> driver
> was expecting the interrupt to happen 1-1 with frames to be read off the
> chip. The chip has a small buffer though, so I changed it to keep reading
> data off the chip if there is more, and not assume it’s just 1 frame for
> each interrupt. It never takes more than a few microseconds for the ISR to
> complete (verified with a logic analyzer).

Can you change the chip config to make it interrupt whenever there is one or
more bytes in the buffer, instead of waiting for an entire frame? Your
ISR/DPC would have to drain the buffer on the chip. That might require that
you exit the ISR without reading one whole frame, depnding on how fast the
subsequent bytes in the frame arrive.

Hope this helps,

Phil

Philip D. Barila
Seagate Technology LLC
(720) 684-1842
As if I need to say it: Not speaking for Seagate.