Weird non-serviced interrupt problem...

I’m looking for some advice and direction on a problem that I’m stuck
on.

Summary: We have systems with a PCI card that, seemingly depending on
which PCI devices it is sharing with, will either work fine, or the ISR
will never be called and the system will hang. I’m all confused.

Long version:

We have a Win2003 SP 2 system that we’ve been shipping for 10 years that
is a PBX. We recently moved to a new motherboard and BIOS (Kontron
ETX-PM). Most of the time, the system works fine. Interrupts work, the
system functions as expected. Furthermore, it shares interrupts with
other motherboard devices (video, Ethernet, USB) without a problem. The
card hardware is not new, it’s basically unchanged for 5 years. The card
driver is not new, it is unchanged for 3 years.

However, we’ve noticed that sometimes the PCI INT A interrupt line from
our card ends up sharing with a “bad sharer” (such as ACPI or one of the
USB devices (Intel 82801DB/DBM device 24C4)), and when that happens, the
very first interrupt that our card generates is never serviced. Using
WinDbg, we’ve set a breakpoint in the driver’s ISR, and it is never
called. However, !idt looks good and the KINTERRUPT objects look good.
The “bad sharer” information could, of course, just be coincidental with
some other issue.

We will let it sit for hours, and it just sits there servicing
higher-priority interrupts but never handles ours or calls our ISR.

We can then pull out our PCI card (yes, naughty, but bear with me),
which stops driving the interrupt, and the system is then fine and
responsive. That just shows that it’s our card generating the interrupt
and that just stopping it from driving is enough to “fix” the hang.

To me, it sounds like what is described in KB 824395 (“Interrupts that
come from a PCI device that uses a Windows NT 4.0-style driver are
ignored”), but someone at Microsoft checked, and that doesn’t seem to be
a problem in Win2003, despite what the KB article says.

I should stress again that MOST OF THE TIME THE SYSTEM WORKS FINE. The
only time it seems to have a problem is when it’s sharing with the ACPI
device or a particular USB device. Thus, it does not seem to be a
hardware or driver issue, but one with how the IRQ resources are
allocated or handled.

We’ve noticed that the system always is fine if we have a USB keyboard /
mouse in the system. We’ve discovered that if we re-flash the BIOS (thus
clearing any BIOS resource allocation) and don’t plug in any USB
devices, that we will have the problem. We can also force the problem by
assigning PCI INT A to IRQ 9 in the BIOS, which is the IRQ used by ACPI.

So, the goal – other than getting it to work correctly – is to figure
out why sharing with those particular devices causes a problem. Sharing
with other devices does not cause a problem, so it does not seem to be a
problem with the hardware or the driver.

Is there a way to get Windows to respect the BIOS settings, or some
other way to control Windows’ IRQ resource allocation?

Thanks in advance for any advice!

Try /PCILOCK on the boot.ini, this prevents the HAL from changing IRQ and
I/O resources.


Don Burn (MVP, Windows DDK)
Windows 2k/XP/2k3 Filesystem and Driver Consulting
Website: http://www.windrvr.com
Blog: http://msmvps.com/blogs/WinDrvr
Remove StopSpam to reply

“Taed Wynnell” wrote in message news:xxxxx@ntdev…
I’m looking for some advice and direction on a problem that I’m stuck
on.

Summary: We have systems with a PCI card that, seemingly depending on
which PCI devices it is sharing with, will either work fine, or the ISR
will never be called and the system will hang. I’m all confused.

Long version:

We have a Win2003 SP 2 system that we’ve been shipping for 10 years that
is a PBX. We recently moved to a new motherboard and BIOS (Kontron
ETX-PM). Most of the time, the system works fine. Interrupts work, the
system functions as expected. Furthermore, it shares interrupts with
other motherboard devices (video, Ethernet, USB) without a problem. The
card hardware is not new, it’s basically unchanged for 5 years. The card
driver is not new, it is unchanged for 3 years.

However, we’ve noticed that sometimes the PCI INT A interrupt line from
our card ends up sharing with a “bad sharer” (such as ACPI or one of the
USB devices (Intel 82801DB/DBM device 24C4)), and when that happens, the
very first interrupt that our card generates is never serviced. Using
WinDbg, we’ve set a breakpoint in the driver’s ISR, and it is never
called. However, !idt looks good and the KINTERRUPT objects look good.
The “bad sharer” information could, of course, just be coincidental with
some other issue.

We will let it sit for hours, and it just sits there servicing
higher-priority interrupts but never handles ours or calls our ISR.

We can then pull out our PCI card (yes, naughty, but bear with me),
which stops driving the interrupt, and the system is then fine and
responsive. That just shows that it’s our card generating the interrupt
and that just stopping it from driving is enough to “fix” the hang.

To me, it sounds like what is described in KB 824395 (“Interrupts that
come from a PCI device that uses a Windows NT 4.0-style driver are
ignored”), but someone at Microsoft checked, and that doesn’t seem to be
a problem in Win2003, despite what the KB article says.

I should stress again that MOST OF THE TIME THE SYSTEM WORKS FINE. The
only time it seems to have a problem is when it’s sharing with the ACPI
device or a particular USB device. Thus, it does not seem to be a
hardware or driver issue, but one with how the IRQ resources are
allocated or handled.

We’ve noticed that the system always is fine if we have a USB keyboard /
mouse in the system. We’ve discovered that if we re-flash the BIOS (thus
clearing any BIOS resource allocation) and don’t plug in any USB
devices, that we will have the problem. We can also force the problem by
assigning PCI INT A to IRQ 9 in the BIOS, which is the IRQ used by ACPI.

So, the goal – other than getting it to work correctly – is to figure
out why sharing with those particular devices causes a problem. Sharing
with other devices does not cause a problem, so it does not seem to be a
problem with the hardware or the driver.

Is there a way to get Windows to respect the BIOS settings, or some
other way to control Windows’ IRQ resource allocation?

Thanks in advance for any advice!

It sure sounds like some other driver is not sharing interrupts
correctly. There is not much you can do about that other than fully
debug it to prove that while your ISR is not getting called, the other
ISR is and is returning TRUE indicating that it has handled the
interrupt, thus blocking your ISR from running. Of course if you could
get yourself installed and connected BEFORE other drivers, life would be
grand.


From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Taed Wynnell
Sent: Thursday, May 31, 2007 1:34 PM
To: Windows System Software Devs Interest List
Subject: [ntdev] Weird non-serviced interrupt problem…

I’m looking for some advice and direction on a problem that I’m stuck
on.

Summary: We have systems with a PCI card that, seemingly depending on
which PCI devices it is sharing with, will either work fine, or the ISR
will never be called and the system will hang. I’m all confused.

Long version:

We have a Win2003 SP 2 system that we’ve been shipping for 10 years that
is a PBX. We recently moved to a new motherboard and BIOS (Kontron
ETX-PM). Most of the time, the system works fine. Interrupts work, the
system functions as expected. Furthermore, it shares interrupts with
other motherboard devices (video, Ethernet, USB) without a problem. The
card hardware is not new, it’s basically unchanged for 5 years. The card
driver is not new, it is unchanged for 3 years.

However, we’ve noticed that sometimes the PCI INT A interrupt line from
our card ends up sharing with a “bad sharer” (such as ACPI or one of the
USB devices (Intel 82801DB/DBM device 24C4)), and when that happens, the
very first interrupt that our card generates is never serviced. Using
WinDbg, we’ve set a breakpoint in the driver’s ISR, and it is never
called. However, !idt looks good and the KINTERRUPT objects look good.
The “bad sharer” information could, of course, just be coincidental with
some other issue.

We will let it sit for hours, and it just sits there servicing
higher-priority interrupts but never handles ours or calls our ISR.

We can then pull out our PCI card (yes, naughty, but bear with me),
which stops driving the interrupt, and the system is then fine and
responsive. That just shows that it’s our card generating the interrupt
and that just stopping it from driving is enough to “fix” the hang.

To me, it sounds like what is described in KB 824395 (“Interrupts that
come from a PCI device that uses a Windows NT 4.0-style driver are
ignored”), but someone at Microsoft checked, and that doesn’t seem to be
a problem in Win2003, despite what the KB article says.

I should stress again that MOST OF THE TIME THE SYSTEM WORKS FINE. The
only time it seems to have a problem is when it’s sharing with the ACPI
device or a particular USB device. Thus, it does not seem to be a
hardware or driver issue, but one with how the IRQ resources are
allocated or handled.

We’ve noticed that the system always is fine if we have a USB keyboard /
mouse in the system. We’ve discovered that if we re-flash the BIOS (thus
clearing any BIOS resource allocation) and don’t plug in any USB
devices, that we will have the problem. We can also force the problem by
assigning PCI INT A to IRQ 9 in the BIOS, which is the IRQ used by ACPI.

So, the goal – other than getting it to work correctly – is to figure
out why sharing with those particular devices causes a problem. Sharing
with other devices does not cause a problem, so it does not seem to be a
problem with the hardware or the driver.

Is there a way to get Windows to respect the BIOS settings, or some
other way to control Windows’ IRQ resource allocation?

Thanks in advance for any advice!


Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

Weird non-serviced interrupt problem…The algorithm for level-triggered
interrupts looks like this:

  1. Ask the first driver in the chain, is your device interrupting?
  2. If it say yes, then EOI and lower IRQL. If no, ask the next driver in
    the chain. Repeat.
  3. When IRQL is lowered, if the line is still asserted, goto 1).

If the ACPI or USB drivers are reporting that their devices are interrupting
when, in fact, they are not, I suspect that is because the hardware on your
motherboard isn’t functioning correctly, fooling them into believe that
their devices interrupted. Both of these drivers run fine on lots of
motherboards. They generally only report that their devices are
interrupting when they are in fact interrupting.

  • Jake Oshins
    Windows Kernel Team

“Taed Wynnell” wrote in message news:xxxxx@ntdev…
I’m looking for some advice and direction on a problem that I’m stuck on.
Summary: We have systems with a PCI card that, seemingly depending on which
PCI devices it is sharing with, will either work fine, or the ISR will never
be called and the system will hang. I’m all confused.
Long version:
We have a Win2003 SP 2 system that we’ve been shipping for 10 years that is
a PBX. We recently moved to a new motherboard and BIOS (Kontron ETX-PM).
Most of the time, the system works fine. Interrupts work, the system
functions as expected. Furthermore, it shares interrupts with other
motherboard devices (video, Ethernet, USB) without a problem. The card
hardware is not new, it’s basically unchanged for 5 years. The card driver
is not new, it is unchanged for 3 years.
However, we’ve noticed that sometimes the PCI INT A interrupt line from our
card ends up sharing with a “bad sharer” (such as ACPI or one of the USB
devices (Intel 82801DB/DBM device 24C4)), and when that happens, the very
first interrupt that our card generates is never serviced. Using WinDbg,
we’ve set a breakpoint in the driver’s ISR, and it is never called. However,
!idt looks good and the KINTERRUPT objects look good. The “bad sharer”
information could, of course, just be coincidental with some other issue.
We will let it sit for hours, and it just sits there servicing
higher-priority interrupts but never handles ours or calls our ISR.
We can then pull out our PCI card (yes, naughty, but bear with me), which
stops driving the interrupt, and the system is then fine and responsive.
That just shows that it’s our card generating the interrupt and that just
stopping it from driving is enough to “fix” the hang.
To me, it sounds like what is described in KB 824395 (“Interrupts that come
from a PCI device that uses a Windows NT 4.0-style driver are ignored”), but
someone at Microsoft checked, and that doesn’t seem to be a problem in
Win2003, despite what the KB article says.
I should stress again that MOST OF THE TIME THE SYSTEM WORKS FINE. The only
time it seems to have a problem is when it’s sharing with the ACPI device or
a particular USB device. Thus, it does not seem to be a hardware or driver
issue, but one with how the IRQ resources are allocated or handled.
We’ve noticed that the system always is fine if we have a USB keyboard /
mouse in the system. We’ve discovered that if we re-flash the BIOS (thus
clearing any BIOS resource allocation) and don’t plug in any USB devices,
that we will have the problem. We can also force the problem by assigning
PCI INT A to IRQ 9 in the BIOS, which is the IRQ used by ACPI.
So, the goal – other than getting it to work correctly – is to figure out
why sharing with those particular devices causes a problem. Sharing with
other devices does not cause a problem, so it does not seem to be a problem
with the hardware or the driver.
Is there a way to get Windows to respect the BIOS settings, or some other
way to control Windows’ IRQ resource allocation?
Thanks in advance for any advice!

“Don Burn” wrote in message news:xxxxx@ntdev…
> Try /PCILOCK on the boot.ini, this prevents the HAL from changing IRQ and
> I/O resources.

I had forgotten about that. However, I tried it on this Windows Server 2003
SP 2 system, and it seems to ignore my BIOS settings / suggestions just as
much as it did without the /PCILOCK.

My Phoenix BIOS even had a “Secured Setup Configuration” Yes/No setting on
the PCI resource page, but playing with that seemed to have no effect,
either.

I’d like to be able to control the PCI IRQ assignment in some way to make
the testing clearer, but, of course, that’s not the real issue.

Thanks, anyway!

“Roddy, Mark” wrote in message news:xxxxx@ntdev…
> It sure sounds like some other driver is not sharing interrupts
> correctly. There is not much you can do about that other than fully
> debug it to prove that while your ISR is not getting called, the other
> ISR is and is returning TRUE indicating that it has handled the
> interrupt, thus blocking your ISR from running. Of course if you could
> get yourself installed and connected BEFORE other drivers, life would be
> grand.

Yeah, that’s an excellent suggestion. However, I went down that path and it
seems to return FALSE. The other driver in the case that I can reproduce
easily is USBPORT.sys, which unfortunately handles 4 different devices on
different IRQs and also there’s too many interrupts going on to debug
reasonably. But I think I caught it on the right one and it returns FALSE.

Is there some way to set a breakpoint in WinDbg that hits only on a certain
KINTERRUPT object or a certain IRQ?

I’m a SoftICE guy (but can’t use it here since plugging in the keyboard
makes the problem go away), so I don’t know how to do the advanced things.
In SoftICE, I’d do something like:
bpx USBPORT!USBPORT_InterruptService if *ebp-8 ==

I recently had to debug an interrupt problem and I learnt a few things.

It might be very difficult without extra hardware. For starters the PC hardware doesn’t offer much diagnostic capabilities. The IOAPIC has no counters, which could be very useful. And AFAIK there is no way to check the current state of a specific interrupt line.

The kernel has a very nice interrupt tracing facility. But it works at the driver’s ISR and DPC level, not at the physical IRQ level. The tracing seems to be designed for measuring ISR and DPC performance, not for diagnostic hardware interrupt problems.

Breakpoints and debug tracing might be misleading because they don’t work on real time. By the time the debugger returns control to the target, interrupt lines could change state. Even conditional breakpoints could be too slow.

AFAIK there is no way to change IRQ assignments on ACPI hals. You do can on non-ACPI ones from the Device Manager. As a temporary workaround, you can swap PCI slots.

In theory you should perform the following steps:

Make sure the card is signaling the interrupt (again, might be difficult without extra hardware).
Make sure IRQ level and interrupt line matches on all the path. !pci, !arbiter, !iopaic and !idt are your friends to verify this.

You might try installing the checked version of the sharer drivers. But I don’t know if they issue debug messages on interrupts.

You can set a breakpoint on the physical ISR. It is located towards the end of the first KINTERRUPT object for that specific IRQ. In other words, the KINTERRUPT object actually contains the ISR code.

Yes, you can set a conditional breakpoint in USBPORT ISR for a specific KINTERRUPT object. But again, conditional breakpoint evaluation could be too slow.

You may not be able to directly tell which interrupt line coming into
the IOAPIC is asserted but you can go through the RTE registers and look
for target vectors that are the same as your device, then selectively
start masking those interrupts to see if there is a change in behavior.

I have only been halfway following this thread but it sounds like a
handler earlier on the chain is claiming the interrupt before the device
gets it, this implies that the earlier devices interrupt is stuck in an
(apparently) asserted condition, masking it (in the IOAPIC RTE) and/or
futzing with the delivery semantics in the might help you root out the
problem. Its possible that the BIOS is incorrectly programming and/or
describing to the OS the delivery configuration.

you can break early during boot and examine the PCI config space for
your device to see what the initial BIOS settings are, and you can also
install the checked ACPI.SYS and use the ACPI debugger to examine the
routing statements in the ASL. My experience has been that the OS
honors whatever settings the BIOS describes in ASL, setting aside
oddities like PCI hot-plug I would expect this to hold true on your
system.

-Zach

-----Original Message-----
From: xxxxx@lists.osr.com [mailto:bounce-288719-
xxxxx@lists.osr.com] On Behalf Of xxxxx@rahul.net
Sent: Friday, June 01, 2007 9:23 AM
To: Windows System Software Devs Interest List
Subject: RE:[ntdev] Weird non-serviced interrupt problem…

I recently had to debug an interrupt problem and I learnt a few things.

It might be very difficult without extra hardware. For starters the PC
hardware doesn’t offer much diagnostic capabilities. The IOAPIC has no
counters, which could be very useful. And AFAIK there is no way to
check
the current state of a specific interrupt line.

wrote in message news:xxxxx@ntdev…
> Make sure the card is signaling the interrupt (again, might be difficult
without extra hardware).
> Make sure IRQ level and interrupt line matches on all the path.
> !pci, !arbiter, !iopaic and !idt are your friends to verify this.

Yay – I’m making some progress…

The !idt, my driver, and the InterruptLine field of the PCI config space
(!pci) for my device all agree that it should be on IRQ 11.

However, by breaking into WinDbg when my system is hung, and then doing !pic
a few times, I see that IRQ 11 is not actually physically requested /
asserted. However, a 'scope does show that I’m driving the interrupt that I
expect to be (PCI interrupt A on my card).

But even more shocking is that I did !pic a bunch of times and the same IRQs
were always physically asserted, and then I physically pull my card, and lo
and behold, IRQ 10 stops being asserted! I did that test twice to make sure
that I actually saw that wackiness.

So, something is whacked out somewhere causing my PCI interrupt A that
should be routed to IRQ 11 to actually be routed to IRQ 10.

Any further suggestions on how to debug how the PCI interrupt router is
programmed and what’s messing it up?

I just learned about some BIOS table called the PCI Interrupt Interrupt
Routing Table, which seemingly should be stored around 0x000F0000 in memory.
However, I was unable to read that address in WinDbg (invalid address) –
does anyone know how to do that?

It seems likely that this is a problem with the motherboard or BIOS at this
point…

Also, I couldn’t figure out the !arbiter command and what it could do for
me, and the online docs were of no use – could you explain that? (Also,
!iopaic doesn’t exist, but I guess you mean !iopic, which didn’t exist
either, but !pic did.)

Thanks!

“Taed Wynnell” wrote in message news:xxxxx@ntdev…
> So, something is whacked out somewhere causing my PCI interrupt A that
> should be routed to IRQ 11 to actually be routed to IRQ 10.

I verified that when it’s in the “bad state” (which is what we were
debugging on the phone), the South Bridge’s interrupt routing control
register 0x60 has 0x0A (IRQ 10), but my device’s and the other devices that
are on physical PCI INTA# have InterruptLine 0x3C registers that contain
0x0B (IRQ 11).

HOWEVER, I did an initial breakpoint with WinDbg, and the values were all
0x0B at that point, but then about 30 seconds later, the South Bridge
interrupt routing control register was changed to 0x0A. So, this seems to
either be a Windows problem, or maybe an error in the BIOS’ PCI Interrupt
Routing Table that causes it to make a bad decision.

After 2 weeks, I’m glad that I’m making some progress…

Taed Wynnell wrote:

But even more shocking is that I did !pic …

Oh, you are using a PIC! I should have realized before. After rereading your original post it seems you are using PIC, hot-plug and a legacy (non Pnp) driver. Hmm, doesn’t sound as the most stable combination.

PIC is the older legacy interrupt controller. Since quite some time it was replaced by the IOAPIC. It is possible that it is disabled in the Bios for some reason. IOAPIC supports 24 interrupts, so you get less interrupt sharing if at all. You need a different HAL if you switch to Ioapic. I don’t know if you can replace the Hal without a full reinstall. And of course I can’t say if this will solve the problem.

However, by breaking into WinDbg when my system is hung, and then doing !pic
a few times, I see that IRQ 11 is not actually physically requested /asserted.

I didn’t have much luck with the !pic command. Don’t know if it is a debugger problem or not, but it didn’t seem to provide meaningful data for me. Anyway, the physically requested bit is not exactly the interrupt pin signal state. That bit is set on the active transition, but is cleared at the interrupt ack cycle. What should remain set since then until EOI is the in-service bit (the ISR bit is set at the same time that the IRR is cleared).

(Also,!iopaic doesn’t exist,

Sorry, it was a typo. I meant !ioapic. But it won’t work in your case.

I verified that when it’s in the “bad state” (which is what we were
debugging on the phone), the South Bridge’s interrupt routing control
register 0x60 has 0x0A (IRQ 10), but my device’s and the other devices that
are on physical PCI INTA# have InterruptLine 0x3C registers that contain
0x0B (IRQ 11).

HOWEVER, I did an initial breakpoint with WinDbg, and the values were all
0x0B at that point, but then about 30 seconds later, the South Bridge
interrupt routing control register was changed to 0x0A.

I have no idea why the 0x60 register is changing, it doesn’t look good at all. But note that 0x60 is not necessarily the relevant register for your interrupt. You might be missing one level of indirection. Those southbridge registers route PIRQ# lines at the bridge to the PIC. But PIRQA is not INTA in the PCI card, otherwise all cards would end using the same interrupt.

There is a further (actually, previous) routing that maps INTx for each PCI slot to PIRQ# signals. Use !acpiirqarb (acpi irq arbiter), or !pciir (non-acpi) to see the links.

I have to start by thanking you; you and the others have been a lot of help!

wrote in message news:xxxxx@ntdev…
> Oh, you are using a PIC! I should have realized before. After rereading
your original post it seems you are using PIC, hot-plug and a legacy (non
Pnp) driver. Hmm, doesn’t sound as the most stable combination.

It’s just ACPI with a PIC with a non-WDM driver. It’s not hot-plug. I just
pulled the PCI card to prove the interrupt was being driven by my card –
it’s safe to do as long as you don’t short something. Of course, you can’t
put it back in with the power on!

> PIC is the older legacy interrupt controller. Since quite some time it was
replaced by the IOAPIC.

Yeah, I was surprised that this new motherboard didn’t have an IOAPIC as
well. It uses the Intel 82801DB.

> I didn’t have much luck with the !pic command. Don’t know if it is a
debugger problem or not, but it didn’t seem to provide meaningful data for
me. Anyway, the physically requested bit is not exactly the interrupt pin
signal state. That bit is set on the active transition, but is cleared at
the interrupt ack cycle. What should remain set since then until EOI is the
in-service bit (the ISR bit is set at the same time that the IRR is
cleared).

It seems to work well for what I’m looking at. I see that it wouldn’t do
the right thing if it were edge-triggered, but it seems to work fine for
level-sensitive.

> I have no idea why the 0x60 register is changing, it doesn’t look good at
all. But note that 0x60 is not necessarily the relevant register for your
interrupt. You might be missing one level of indirection. Those southbridge
registers route PIRQ# lines at the bridge to the PIC. But PIRQA is not INTA
in the PCI card, otherwise all cards would end using the same interrupt.

I assume it’s changing because Windows is re-mapping the IRQs, and 20
seconds or so after the boot sounds like the right time for that.

The two cases of PCI INT A are the same in my case for that particular slot;
I verified the routing with the motherboard manufacturer, and it seems to
make sense given what I see. Also, the other devices that are on physical
PIRQA (again, according to the motherboard manufacturer) on the South Bridge
always are routed to the same IRQ as my device by using !idt or msinfo32, so
that double-confirms the routing.

> There is a further (actually, previous) routing that maps INTx for each
PCI slot to PIRQ# signals. Use !acpiirqarb (acpi irq arbiter), or !pciir
(non-acpi) to see the links.

Those sound perfect, but they didn’t work for me. The first was at least a
valid command but gave the error:
Failed to get symbol acpi!ARBITER_EXTENSION
but I verfied that I have the right Symbol Server symbols for ACPI, and also
I can do a “dt acpi!ARBITER_EXTENSION f0000000” and it shows the structure
just fine. A Google search on that error didn’t come up with anything.

Again, thanks for all your help!

Taed Wynnell wrote:

Yeah, I was surprised that this new motherboard didn’t have an IOAPIC as
well. It uses the Intel 82801DB.

The 82801 definitely has IOAPIC. It is probably disabled by default in the BIOS because those boards usually have very strong legacy support.

I assume it’s changing because Windows is re-mapping the IRQs, and 20
seconds or so after the boot sounds like the right time for that.

Sorry, I understood it was constantly changing. A possible idea would be to check if the time it changes has some relation to your driver being loaded.

> Use !acpiirqarb (acpi irq arbiter), or !pciir (non-acpi) to see the links.

Those sound perfect, but they didn’t work for me. The first was at least a
valid command but gave the error:
Failed to get symbol acpi!ARBITER_EXTENSION

I don’t know why it doesn’t work for you. But as long as you are sure that you are routed to the first southbridge slot (PIRQA, register 0x60), then you probably don’t need it.

Probably a driver rewrite from NT4-style to WDM can help.


Maxim Shatskih, Windows DDK MVP
StorageCraft Corporation
xxxxx@storagecraft.com
http://www.storagecraft.com

“Taed Wynnell” wrote in message news:xxxxx@ntdev…
I’m looking for some advice and direction on a problem that I’m stuck
on.

Summary: We have systems with a PCI card that, seemingly depending on
which PCI devices it is sharing with, will either work fine, or the ISR
will never be called and the system will hang. I’m all confused.

Long version:

We have a Win2003 SP 2 system that we’ve been shipping for 10 years that
is a PBX. We recently moved to a new motherboard and BIOS (Kontron
ETX-PM). Most of the time, the system works fine. Interrupts work, the
system functions as expected. Furthermore, it shares interrupts with
other motherboard devices (video, Ethernet, USB) without a problem. The
card hardware is not new, it’s basically unchanged for 5 years. The card
driver is not new, it is unchanged for 3 years.

However, we’ve noticed that sometimes the PCI INT A interrupt line from
our card ends up sharing with a “bad sharer” (such as ACPI or one of the
USB devices (Intel 82801DB/DBM device 24C4)), and when that happens, the
very first interrupt that our card generates is never serviced. Using
WinDbg, we’ve set a breakpoint in the driver’s ISR, and it is never
called. However, !idt looks good and the KINTERRUPT objects look good.
The “bad sharer” information could, of course, just be coincidental with
some other issue.

We will let it sit for hours, and it just sits there servicing
higher-priority interrupts but never handles ours or calls our ISR.

We can then pull out our PCI card (yes, naughty, but bear with me),
which stops driving the interrupt, and the system is then fine and
responsive. That just shows that it’s our card generating the interrupt
and that just stopping it from driving is enough to “fix” the hang.

To me, it sounds like what is described in KB 824395 (“Interrupts that
come from a PCI device that uses a Windows NT 4.0-style driver are
ignored”), but someone at Microsoft checked, and that doesn’t seem to be
a problem in Win2003, despite what the KB article says.

I should stress again that MOST OF THE TIME THE SYSTEM WORKS FINE. The
only time it seems to have a problem is when it’s sharing with the ACPI
device or a particular USB device. Thus, it does not seem to be a
hardware or driver issue, but one with how the IRQ resources are
allocated or handled.

We’ve noticed that the system always is fine if we have a USB keyboard /
mouse in the system. We’ve discovered that if we re-flash the BIOS (thus
clearing any BIOS resource allocation) and don’t plug in any USB
devices, that we will have the problem. We can also force the problem by
assigning PCI INT A to IRQ 9 in the BIOS, which is the IRQ used by ACPI.

So, the goal – other than getting it to work correctly – is to figure
out why sharing with those particular devices causes a problem. Sharing
with other devices does not cause a problem, so it does not seem to be a
problem with the hardware or the driver.

Is there a way to get Windows to respect the BIOS settings, or some
other way to control Windows’ IRQ resource allocation?

Thanks in advance for any advice!

The key here is probably that you have a non-WDM driver. I missed that in
the early posts. It’s still possible that the BIOS is busted, giving you
the wrong routing table.

Make sure your driver is actually passing its device object as the fourth
parameter to HalAssignSlotResources. In some motherboards, it just wasn’t
possible to get the routing right unless we can tie an NT4-style driver back
to the actual physical device which is causing the interrupt, which can only
happen when you tell the OS which driver is making the call for this PCI
device and the only way to do that is to pass your DeviceObject. Many NT4
drivers just passed NULL here.

  • Jake Oshins
    Windows Kernel Team

“Taed Wynnell” wrote in message news:xxxxx@ntdev…
>I have to start by thanking you; you and the others have been a lot of
>help!
>
> wrote in message news:xxxxx@ntdev…
>> Oh, you are using a PIC! I should have realized before. After rereading
> your original post it seems you are using PIC, hot-plug and a legacy (non
> Pnp) driver. Hmm, doesn’t sound as the most stable combination.
>
> It’s just ACPI with a PIC with a non-WDM driver. It’s not hot-plug. I
> just
> pulled the PCI card to prove the interrupt was being driven by my card –
> it’s safe to do as long as you don’t short something. Of course, you
> can’t
> put it back in with the power on!
>
>> PIC is the older legacy interrupt controller. Since quite some time it
>> was
> replaced by the IOAPIC.
>
> Yeah, I was surprised that this new motherboard didn’t have an IOAPIC as
> well. It uses the Intel 82801DB.
>
>> I didn’t have much luck with the !pic command. Don’t know if it is a
> debugger problem or not, but it didn’t seem to provide meaningful data for
> me. Anyway, the physically requested bit is not exactly the interrupt pin
> signal state. That bit is set on the active transition, but is cleared at
> the interrupt ack cycle. What should remain set since then until EOI is
> the
> in-service bit (the ISR bit is set at the same time that the IRR is
> cleared).
>
> It seems to work well for what I’m looking at. I see that it wouldn’t do
> the right thing if it were edge-triggered, but it seems to work fine for
> level-sensitive.
>
>> I have no idea why the 0x60 register is changing, it doesn’t look good at
> all. But note that 0x60 is not necessarily the relevant register for your
> interrupt. You might be missing one level of indirection. Those
> southbridge
> registers route PIRQ# lines at the bridge to the PIC. But PIRQA is not
> INTA
> in the PCI card, otherwise all cards would end using the same interrupt.
>
> I assume it’s changing because Windows is re-mapping the IRQs, and 20
> seconds or so after the boot sounds like the right time for that.
>
> The two cases of PCI INT A are the same in my case for that particular
> slot;
> I verified the routing with the motherboard manufacturer, and it seems to
> make sense given what I see. Also, the other devices that are on physical
> PIRQA (again, according to the motherboard manufacturer) on the South
> Bridge
> always are routed to the same IRQ as my device by using !idt or msinfo32,
> so
> that double-confirms the routing.
>
>> There is a further (actually, previous) routing that maps INTx for each
> PCI slot to PIRQ# signals. Use !acpiirqarb (acpi irq arbiter), or !pciir
> (non-acpi) to see the links.
>
> Those sound perfect, but they didn’t work for me. The first was at least
> a
> valid command but gave the error:
> Failed to get symbol acpi!ARBITER_EXTENSION
> but I verfied that I have the right Symbol Server symbols for ACPI, and
> also
> I can do a “dt acpi!ARBITER_EXTENSION f0000000” and it shows the structure
> just fine. A Google search on that error didn’t come up with anything.
>
> Again, thanks for all your help!
>
>
>

“Jake Oshins” wrote in message
news:xxxxx@ntdev…
> The key here is probably that you have a non-WDM driver. I missed that in
> the early posts. It’s still possible that the BIOS is busted, giving you
> the wrong routing table.

Good point, so I just eliminated our card and driver from the picture. I
renamed our drivers, powered down, pulled out our card, and still Windows
re-assigns the PCI interrupt to the (bad) value and doesn’t update the
devices’ InterruptLine register.

The stack where it’s changing it to the bad value is:
FrameEBP RetEIP Syms Symbol
F78F2470 80A4C2A2 Y hal!_WRITE_PORT_ULONG+0009
F78F2484 80A4C0E4 Y hal!_HalpPCIWriteUcharType1+0026
F78F24C8 80A4C513 Y hal!_HalpPCIConfig+005C
F78F24E8 80A4D07D Y hal!_HalpWritePCIConfig+002D
F78F2578 F733D89A Y hal!_HaliPciInterfaceWriteConfig+0033
F78F259C F733D92E Y pci!_PciReadWriteConfigSpace+0038
F78F25BC F733DD86 Y pci!_PciWriteDeviceConfig+001E
F78F26E4 F73413BF Y pci!_PciExternalWriteDeviceConfig+01AC
F78F2708 F733D04D Y pci!_PciWriteDeviceSpace+005D
F78F2728 F736040F Y pci!_PciPnpWriteConfig+001D
F78F2798 F73605F3 Y acpi!_PciConfigSpaceHandlerWorker+014D
F78F27B4 F736CB4E Y acpi!_PciConfigSpaceHandler+006F
F78F27E0 F7365404 Y acpi!_InternalOpRegionHandler+0046
F78F2814 F7367D77 Y acpi!_WriteCookAccess+0145
F78F283C F7369622 Y acpi!_RunContext+0065
F78F2854 F73696ED Y acpi!_InsertReadyQueue+00A4
F78F2870 F7368A36 Y acpi!_RestartContext+0027
F78F2890 F736467E Y acpi!_AsyncEvalObject+0150
F78F28C4 F735EAE0 Y acpi!_AMLIAsyncEvalObject+006F
F78F2900 F735EDAF Y acpi!_AcpiArbSetLinkNodeIrqWorker+024E
F78F291C F7376CFD Y acpi!_AcpiArbSetLinkNodeIrqAsync+0055
F78F294C F7376D8B Y acpi!_AcpiArbSetLinkNodeIrq+0033
F78F2978 F7376EA6 Y acpi!_MakeTempLinkNodeCountsPermanent+006B
F78F29B8 F737C7F7 Y acpi!_AcpiArbCommitAllocation+00C8
F78F29D0 808F9BB2 Y acpi!_ArbArbiterHandler+0043
F78F29EC 808FC5D7 Y ntoskrnl!_IopCommitConfiguration+0024
F78F2A40 80909E68 Y ntoskrnl!_IopAllocateResources+0189
F78F2A98 80909FD3 Y ntoskrnl!_IopAssignResourcesToDevices+0100
F78F2AD4 80903FE4 Y ntoskrnl!_IopProcessAssignResources+00D9
F78F2D30 809045EC Y ntoskrnl!_PipProcessDevNodeTree+00A6
F78F2D58 808224E0 Y ntoskrnl!_PiProcessStartSystemDevices+003A
F78F2D80 8087A469 Y ntoskrnl!_PipDeviceActionWorker+0186
F78F2DAC 8094095C Y ntoskrnl!_ExpWorkerThread+00EB
F78F2DDC 8088757A Y ntoskrnl!_PspSystemThreadStartup+002E
00000000 00000000 Y ntoskrnl!_KiThreadStartup+0016

I’ve had a Premier ticket open with Microsoft, but they haven’t been of any
help so far…

Thanks!

That stack trace is the ACPI driver interpretting BIOS code. If your BIOS
is writing the wrong value (with help from the ACPI driver) in the interrupt
pin register, you need a new BIOS.

  • Jake

“Taed Wynnell” wrote in message news:xxxxx@ntdev…
> “Jake Oshins” wrote in message
> news:xxxxx@ntdev…
>> The key here is probably that you have a non-WDM driver. I missed that
>> in
>> the early posts. It’s still possible that the BIOS is busted, giving you
>> the wrong routing table.
>
> Good point, so I just eliminated our card and driver from the picture. I
> renamed our drivers, powered down, pulled out our card, and still Windows
> re-assigns the PCI interrupt to the (bad) value and doesn’t update the
> devices’ InterruptLine register.
>
> The stack where it’s changing it to the bad value is:
> FrameEBP RetEIP Syms Symbol
> F78F2470 80A4C2A2 Y hal!_WRITE_PORT_ULONG+0009
> F78F2484 80A4C0E4 Y hal!_HalpPCIWriteUcharType1+0026
> F78F24C8 80A4C513 Y hal!_HalpPCIConfig+005C
> F78F24E8 80A4D07D Y hal!_HalpWritePCIConfig+002D
> F78F2578 F733D89A Y hal!_HaliPciInterfaceWriteConfig+0033
> F78F259C F733D92E Y pci!_PciReadWriteConfigSpace+0038
> F78F25BC F733DD86 Y pci!_PciWriteDeviceConfig+001E
> F78F26E4 F73413BF Y pci!_PciExternalWriteDeviceConfig+01AC
> F78F2708 F733D04D Y pci!_PciWriteDeviceSpace+005D
> F78F2728 F736040F Y pci!_PciPnpWriteConfig+001D
> F78F2798 F73605F3 Y acpi!_PciConfigSpaceHandlerWorker+014D
> F78F27B4 F736CB4E Y acpi!_PciConfigSpaceHandler+006F
> F78F27E0 F7365404 Y acpi!_InternalOpRegionHandler+0046
> F78F2814 F7367D77 Y acpi!_WriteCookAccess+0145
> F78F283C F7369622 Y acpi!_RunContext+0065
> F78F2854 F73696ED Y acpi!_InsertReadyQueue+00A4
> F78F2870 F7368A36 Y acpi!_RestartContext+0027
> F78F2890 F736467E Y acpi!_AsyncEvalObject+0150
> F78F28C4 F735EAE0 Y acpi!_AMLIAsyncEvalObject+006F
> F78F2900 F735EDAF Y acpi!_AcpiArbSetLinkNodeIrqWorker+024E
> F78F291C F7376CFD Y acpi!_AcpiArbSetLinkNodeIrqAsync+0055
> F78F294C F7376D8B Y acpi!_AcpiArbSetLinkNodeIrq+0033
> F78F2978 F7376EA6 Y acpi!_MakeTempLinkNodeCountsPermanent+006B
> F78F29B8 F737C7F7 Y acpi!_AcpiArbCommitAllocation+00C8
> F78F29D0 808F9BB2 Y acpi!_ArbArbiterHandler+0043
> F78F29EC 808FC5D7 Y ntoskrnl!_IopCommitConfiguration+0024
> F78F2A40 80909E68 Y ntoskrnl!_IopAllocateResources+0189
> F78F2A98 80909FD3 Y ntoskrnl!_IopAssignResourcesToDevices+0100
> F78F2AD4 80903FE4 Y ntoskrnl!_IopProcessAssignResources+00D9
> F78F2D30 809045EC Y ntoskrnl!_PipProcessDevNodeTree+00A6
> F78F2D58 808224E0 Y ntoskrnl!_PiProcessStartSystemDevices+003A
> F78F2D80 8087A469 Y ntoskrnl!_PipDeviceActionWorker+0186
> F78F2DAC 8094095C Y ntoskrnl!_ExpWorkerThread+00EB
> F78F2DDC 8088757A Y ntoskrnl!_PspSystemThreadStartup+002E
> 00000000 00000000 Y ntoskrnl!_KiThreadStartup+0016
>
> I’ve had a Premier ticket open with Microsoft, but they haven’t been of
> any
> help so far…
>
> Thanks!
>
>
>