Loosing Interrupts on a PCI Express card running under XP

Hi All,

We?ve recently run into a problem, and I?m hoping that someone might be able to point us in the right direction.

Problem:

It looks like we are loosing an interrupt from one of our in-house designed PCI Express cards, which causes our system to basically pause until it magically recovers some time in the future (5 seconds to 45 min). The PC does not halt, keyboard, mouse, etc all work fine. It appears that the card ?thinks? it?s asserting an interrupt, but our driver?s ISR is not being called.

Details:

The PCI Express card performs multiple functions and can trigger an interrupt based on 3 different (async) events, a DMA transfer complete, a communications message being received (CANbus if anyone cares), or a hardware error (probably not relevant in this case). The card actually has 3 internal interrupt lines for each of these interrupts. These are AND?ed together with their respective enable lines, and the 3 results are then OR?ed as they pass into a COTS PCI Express IP core, which is then supposed to send the interrupt message over the PCI Express Bus.

We are running this on XP SP2, so no MSI, just straight PCI from Windows and the driver?s perspective.

The driver?s ISR reads a status register (and a control register) only once at the beginning of the ISR. It checks for the 3 different interrupt causes and disables each individual interrupt line if it finds one of them active. It checks for all three always (3 ?if? blocks). If any of these are active, we queue our DPC and return TRUE from the ISR. The Interrupt saves the status (and control) registers in a shadow copy, one for each of the interrupt sources. Later on in the DPC, we check for each source (using the shadow registers saved from the top of the ISR), handle it, and re-enable the individual interrupt.

Thoughts:

After thinking about this a while, I am wondering what happens if we get a DMA interrupt, enter the ISR, and then get a communications interrupt after we?ve already read the Status register in the ISR. We would only disable the DMA interrupt from interrupting, but return TRUE from the ISR. I had thought that with PCI using level triggered interrupts, our ISR would just get called again because the interrupt is still active, but now I?m not so sure.

It does appear that we only loose the card?s interrupt when we are performing DMA transfers and communications are active, but not always.

And I have no idea at all why, if we loose an interrupt, it ever gets unstuck. Once logging in via remote desktop unstuck a system, but not a second time. Also accessing another communications device did it once too, but not every time.

Any thoughts, advice or suggestions are greatly appreciated. If anyone needs any more information, I?ll be happy to provide it. Thank you!

Best regards,
-Mike

FIRST, thanks for taking the time to put together a well structure, informative, and clear post to the list. It’s a pleasure to try to answer such questions.

BINGO! You’re right about PCI, but you’re on the PCI Express Bus.

Even in legacy INTx mode PCI Express interrupts are generated via messages on the bus… so I *believe* (I’m not 100% sure, somebody will certainly correct me if I’m wrong… I’m not an expert on PCIe) you need to treat these also as if they had edge-triggered semantics.

Can you check the Interrupt Status Register bits from your DpcForIsr, for example, and then service the various events that might be outstanding there?

Your symptoms TOTALLY sound like you’re dropping an edge-triggered interrupt…

Peter
OSR

(Following-up my own post)

I just did some quick research, to see if my initial reply about PCIe was correct. It might not be.

According to the PCI-Sig, PCIe legacy INTx support is supposed to have level-triggered semantics.

That would rule-out the whole edge-triggered issue in terms of INTx support on PCIe.

Hmmm… It STILL feels a lot like a dropped edge-triggered interrupt to me – Guess I’ll have to leave more of the diagnosis to one of the more hardware knowledgeable folks on the forum.

Sorry to not be of more help,

Peter
OSR

Certainly, it is hard to say anything without actually seeing the your code, but the very first thought that gets into my head is that the root of your problem is described by the following statement:

[begin quote]

It checks for the 3 different interrupt causes and disables each individual interrupt line if it finds one of them active. It checks for all three always (3 ?if? blocks). If any of these are active, we queue our DPC and return TRUE from the ISR.The Interrupt saves the status (and control) registers in a shadow copy, one for each of the interrupt sources. Later on in the DPC, we check for each source (using the shadow registers saved from the top of the ISR), handle it, and re-enable the individual interrupt.

[end quote]

If interrupt line remains asserted after ISR returns TRUE ( as it should happen with level-triggered interrupt in the situation you describe), interrupt will fire again. It will happen immediately after interrupt gets unmasked in both PPR and TRP (I assume APIC -based machine here), i.e. well before
your DPC routine gets executed. Therefore, in this case ISR will run twice. On each invocation it will disable its corresponding line and queue a DPC. However, the same DPC cannot be queued more than once, can it…

What happens if DPC routine assumes one-to-one correspondence between interrupt occurrence and its invocation, i.e. that it does not have to re-enable more than one line per invocation??? The answer is obvious - the line that got disabled upon the second interrupt occurrence remains in disabled state.

Judging from your description of a problem, this is exactly what happens here…

Anton Bassov

Good one, Anton! That’s certainly a very common error. I was so focused on the PCIe aspect of the problem that this more obvious issue didn’t occur to me. Nice pickup.

Peter
OSR

>that with PCI using level triggered interrupts, our ISR would just get called again because the interrupt

is still active

Should be so.


Maxim S. Shatskih
Windows DDK MVP
xxxxx@storagecraft.com
http://www.storagecraft.com

>and return TRUE from the ISR. The Interrupt saves the status (and control) registers in a shadow copy,

Anton is correct. The good way of solving the issue pointed by Anton:

a) ISR

IntStatus = READ_REGISTER_ULONG(interrupt status);
if( IntStatus == 0 )
return FALSE; // we are not interrupting
// note that, if several ISR invocations interrupt the same DPC (which can occur), the first one will
// put the bits to zero ->IntStatusShadow and the second one must preserve the bits from the first
// one, thus OR instead of copy
DevExt->IntStatusShadow |= IntStatus;
… tell the hardware that all interrupts in IntStatus are serviced
KeInsertQueueDpc // no-op if the DPC is already in the queue, queues another DPC possible to
// another CPU if the DPC is running, but this is safe
return TRUE;

b) DPC
{
ULONG IntStatusLocal;
for(;:wink:
{
KeAcquireInterruptSpinLock
IntStatusLocal = DevExt->IntStatusShadow;
DevExt->IntStatusShadow = 0;
KeReleaseInterruptSpinLock
if( IntStatusLocal == 0 )
break;
handle each bit it IntStatusLocal
}
}

For pre-XP, move the statements within the interrupt spinlock to a KeSynchronizeExecution callback.

Probably the missing OR was your issue, since in this case, if 2 interrupts hit during the DPC execution, then the first one will be lost and its cause will never be handled by the DPC.


Maxim S. Shatskih
Windows DDK MVP
xxxxx@storagecraft.com
http://www.storagecraft.com

> That’s certainly a very common error. I was so focused on the PCIe aspect of the problem

that this more obvious issue didn’t occur to me.

What makes me surprised here is that the whole thing eventually recovers - if this is, indeed, a root of a problem, then, depending on the order of IF blocks in ISR and DPC, either line should disabled forever, or
two lines should become mutually exclusive, i.e. when one of them is enabled another one is disabled.
Maybe I am just too dumb, but, to be honest, I just cannot see how both lines may become enabled since the very first occurrence of the above scenario…

Anton Bassov

Thank you everyone for taking the time to look at this.

First, it?s good news to hear that I will be interrupted again if the interrupt doesn?t go away. This is what I expected and designed for.

The DPC is supposed to handle it if the interrupt was called multiple times before the DPC got to run. I actually have 3 sets of shadow registers, one for each interrupt reason. And the DPC should also handle the cases when it?s run on multiple processors at the same time. Probably the easiest thing to do now is to attach a slightly paired down version of the DCP and ISR code for everyone to look at, because I can?t find anything wrong with it so far.

Thanks!

enum INTERRUPT_REASON
{
ISR_FOR_DMA,
ISR_FOR_CANBUS,
ISR_FOR_ERRORS,
NUM_INTERRUPT_REASONS

};

typedef struct _ISR_DATA{

//
// Is this interrupt data valid? Did we get this interrupt?
//
BOOLEAN Active;

//
// Shadow copies of the Status and Control registers from this ISR entry
//
ULONG ShadowIsrStatusRegister;
ULONG ShadowIsrControlRegister;

}ISR_DATA;

typedef struct _GEMINI_DEVICE_EXT {

//
// …
//

//
// Why were we interrupted?
//
ISR_DATA InterruptReason[NUM_INTERRUPT_REASONS];

//
// …
//

} GEMINI_DEVICE_EXT, *PGEMINI_DEVICE_EXT;

BOOLEAN HandleInterrupt ( PKINTERRUPT Interrupt,
PVOID ServiceContext)
{
//--------------------------------------------------------------------
PGEMINI_DEVICE_EXT devExt = (PGEMINI_DEVICE_EXT)ServiceContext;
ULONG StatusRegister;
ULONG ControlRegister;
BOOLEAN OurInterrupt = FALSE;
//--------------------------------------------------------------------

UNREFERENCED_PARAMETER(Interrupt);

//
// Always check the signature of the struct!
//
ASSERT(devExt->magic == GEMINI_DEVICE_EXT_MAGIC);

//
// Read out the Status and control once
//
StatusRegister = ReadStatusRegister(devExt);
ControlRegister = ReadControlRegister(devExt);

//
// Check if it’s an error interrupt first
//
if (StatusRegister & ( STATUS_BIT_MEMORY_ERROR
| STATUS_BIT_LINK_PROTO_ERR)){

devExt->InterruptReason[ISR_FOR_ERRORS].Active = TRUE;
devExt->InterruptReason[ISR_FOR_ERRORS].ShadowIsrStatusRegister = StatusRegister;
devExt->InterruptReason[ISR_FOR_ERRORS].ShadowIsrControlRegister = ControlRegister;

//
// For now, just clear the errors
//
DisableErrorInterrupts(devExt);
ClearErrors(devExt);

OurInterrupt = TRUE;
}

//
// Check if it’s our DMA interrupt
//
if (StatusRegister & STATUS_BIT_DMA_INT) {

devExt->InterruptReason[ISR_FOR_DMA].Active = TRUE;
devExt->InterruptReason[ISR_FOR_DMA].ShadowIsrStatusRegister = StatusRegister;
devExt->InterruptReason[ISR_FOR_DMA].ShadowIsrControlRegister = ControlRegister;

//
// Turn off the DMA_ENABLE bit to deassert the interrupt.
//
ClearControlRegisterBits(devExt, CTRL_BIT_ENABLE_DMA);
DisableDmaInterrupts(devExt);

OurInterrupt = TRUE;
}

//
// Check if it’s our CANbus interrupt
//
if (StatusRegister & STATUS_BIT_CAN_INT) {

devExt->InterruptReason[ISR_FOR_CANBUS].Active = TRUE;
devExt->InterruptReason[ISR_FOR_CANBUS].ShadowIsrStatusRegister = StatusRegister;
devExt->InterruptReason[ISR_FOR_CANBUS].ShadowIsrControlRegister = ControlRegister;

//
// We must disable these here too
//
DisableCanbusInterrupts(devExt);

if (devExt->FirmwareProductVersion >= 2){
//
// Debug timers
//
PDEBUG_LATENCY_TIMERS Timers = &devExt->Statistics.DebugLatencyTimers;

StopCanbusTimer(devExt);
Timers->CanbusRegister = ReadCanbusTimerReg(devExt);
WriteCanbusTimerReg(devExt, 0);

if (Timers->CanbusRegister > Timers->MaxCanbusIsrLatency){
Timers->MaxCanbusIsrLatency = Timers->CanbusRegister;
}

__int64 TimerTmp = (Timers->AverageCanbusIsrLatency * Timers->CanbusIsrCount)

  • Timers->CanbusRegister;
    Timers->CanbusIsrCount++;

Timers->AverageCanbusIsrLatency = (ULONG)(TimerTmp / Timers->CanbusIsrCount);
}

OurInterrupt = TRUE;
}

//
// If it wasn’t ours, get out!
//
if (!OurInterrupt){
devExt->Statistics.NotOurInterrupt++;
return (FALSE);
}

//
// Measure the ISR / DPC latency
//
WriteDebugTimerReg(devExt, 0);
StartDebugTimer(devExt);

//
// Ok - if we’re here it is out interrupt.
// Queue a DPCForISR
//
IoRequestDpc( devExt->FunctionalDeviceObject,
NULL,
NULL);

return (TRUE);
}

VOID DpcForIsr( PKDPC Dpc,
PDEVICE_OBJECT DeviceObject,
PIRP Unused,
PVOID Context)
{
//-------------------------------------------------------------------------------------
PGEMINI_DEVICE_EXT devExt = (PGEMINI_DEVICE_EXT) DeviceObject->DeviceExtension;
KIRQL oldIrql;
PIRP readIrp;
ULONG i;
ISR_DATA InterruptReason[NUM_INTERRUPT_REASONS];
BOOLEAN IrpIsCancelled;
NTSTATUS status = STATUS_SUCCESS;
NTSTATUS ErrorIsrStatus = STATUS_SUCCESS;
NTSTATUS DmaIsrStatus = STATUS_SUCCESS;

QUEUED_IO_COMPLETE_REQUEST QueuedIoCompleteRequests[MAX_QUEUED_IO_COMPLETE_REQUESTS];
ULONG CurrentIoCompleteQueue;
//-------------------------------------------------------------------------------------

UNREFERENCED_PARAMETER(Dpc);
UNREFERENCED_PARAMETER(Unused);
UNREFERENCED_PARAMETER(Context);

//
// By definition DPC’s are called at DISPATCH_LEVEL, so we shouldn’t
// ever not be called at DISPATCH_LEVEL
//
ASSERT(KeGetCurrentIrql() == DISPATCH_LEVEL);

//
// Always check the signature of the struct!
//
ASSERT(devExt->magic == GEMINI_DEVICE_EXT_MAGIC);

//---------------------------------------------------------------------
//
// Grab our interrupt spinlock
//
oldIrql = KeAcquireInterruptSpinLock(devExt->InterruptObject);

//
// Make a local copy of the device struct
//
RtlCopyMemory( &InterruptReason,
&devExt->InterruptReason,
NUM_INTERRUPT_REASONS * sizeof (ISR_DATA));

//
// And zero out the DevExt copy
//
RtlZeroMemory ( &devExt->InterruptReason,
NUM_INTERRUPT_REASONS * sizeof (ISR_DATA)
);

//
// Release our interrupt spinlock
//
KeReleaseInterruptSpinLock(devExt->InterruptObject, oldIrql);
//---------------------------------------------------------------------

//
// Zero our Queue of Irp Completions right away
//
RtlZeroMemory( QueuedIoCompleteRequests,
sizeof(QUEUED_IO_COMPLETE_REQUEST) * MAX_QUEUED_IO_COMPLETE_REQUESTS
);

CurrentIoCompleteQueue = 0;

//
// CANbus data or event? ---------------------------------------------
//
if (InterruptReason[ISR_FOR_CANBUS].Active) {

HandleCanbusDpc(devExt);

//
// Enable the interrupt when we are done
//
EnableCanbusInterrupts(devExt);
}

//
// Did we have any errors? --------------------------------------------
//
if (InterruptReason[ISR_FOR_ERRORS].Active) {

//
// Check these for the statistics we should be keeping
//
ErrorIsrStatus = CheckStatusErrorsForImageRead( devExt,
InterruptReason[ISR_FOR_ERRORS].ShadowIsrStatusRegister);

//
// Enable the interrupt when we are done
//
EnableErrorInterrupts(devExt);

//
// More processing here…
//

}

//
// Is an Image Read complete? ----------------------------------------
//
if (InterruptReason[ISR_FOR_DMA].Active) {

//
// More processing here…
//

}// End of - InterruptReason[ISR_FOR_DMA]

//
// More processing here…
//

return;
}

> The DPC is supposed to handle it if the interrupt was called multiple times before the DPC got to run.

Have you tried attaching the debugger to the failed machine and looking at the interrupt-related fields of the devext?

Also check that the logic of saving the interrupt state in the ISR is the logical OR and can sustain several arrived interrupts before the DPC.

Also note that, in the DPC, the interrupt state you see under the interrupt spinlock can be empty - no interrupt. In this case, the DPC should bail out just after releasing the lock.


Maxim S. Shatskih
Windows DDK MVP
xxxxx@storagecraft.com
http://www.storagecraft.com

You enable/disable different intr sources individually but you have only *one* dpc. You could run into many problems?with this scheme.
?
Save yourself by either:

  1. one dpc, disable all interrupts in isr, enable all interrupts?at the then of dpc, OR
  2. one dpc for *each* intr you want to turn on/off separately.


Calvin Guan
Broadcom Corp.
Connecting Everything(r)

— On Sat, 2/21/09, xxxxx@hologic.com wrote:

From: xxxxx@hologic.com
Subject: RE:[ntdev] Loosing Interrupts on a PCI Express card running under XP
To: “Windows System Software Devs Interest List”
Date: Saturday, February 21, 2009, 2:15 PM

Thank you everyone for taking the time to look at this.?

First, it?s good news to hear that I will be interrupted again if the interrupt doesn?t go away.? This is what I expected and designed for.?

The DPC is supposed to handle it if the interrupt was called multiple times before the DPC got to run.? I actually have 3 sets of shadow registers, one for each interrupt reason.? And the DPC should also handle the cases when it?s run on multiple processors at the same time.? Probably the easiest thing to do now is to attach a slightly paired down version of the DCP and ISR code for everyone to look at, because I can?t find anything wrong with it so far.?

Thanks!

enum INTERRUPT_REASON
{
? ? ISR_FOR_DMA,
? ? ISR_FOR_CANBUS,
? ? ISR_FOR_ERRORS,
? ? NUM_INTERRUPT_REASONS

};

typedef struct _ISR_DATA{

? ? //
? ? //? Is this interrupt data valid? Did we get this interrupt?
? ? //
? ? BOOLEAN Active;

? ? //
? ? //? Shadow copies of the Status and Control registers from this ISR entry
? ? //
? ? ULONG? ? ? ? ? ? ???ShadowIsrStatusRegister;
? ? ULONG? ? ? ? ? ? ???ShadowIsrControlRegister;

}ISR_DATA;

typedef struct _GEMINI_DEVICE_EXT {

? ? //
? ? // …
? ? //

? ? //
? ? // Why were we interrupted?
? ? //
? ? ISR_DATA? ? ? ? ? ? InterruptReason[NUM_INTERRUPT_REASONS];

? ? //
? ? // …
? ? //

} GEMINI_DEVICE_EXT, *PGEMINI_DEVICE_EXT;

BOOLEAN HandleInterrupt (???PKINTERRUPT Interrupt,
? ? ? ? ? ? ? ? ? ? PVOID? ? ???ServiceContext)
{
? ? //--------------------------------------------------------------------
? ? PGEMINI_DEVICE_EXT? devExt? ? ???= (PGEMINI_DEVICE_EXT)ServiceContext;
? ? ULONG? ? ? ? ? ? ???StatusRegister;
? ? ULONG? ? ? ? ? ? ???ControlRegister;
? ? BOOLEAN? ? ? ? ? ???OurInterrupt = FALSE;
? ? //--------------------------------------------------------------------

? ? UNREFERENCED_PARAMETER(Interrupt);

? ? //
? ? //? Always check the signature of the struct!
? ? //
? ? ASSERT(devExt->magic == GEMINI_DEVICE_EXT_MAGIC);

? ? //
? ? //? Read out the Status and control once
? ? //
? ? StatusRegister? = ReadStatusRegister(devExt);
? ? ControlRegister = ReadControlRegister(devExt);

? ? //
? ? //? Check if it’s an error interrupt first
? ? //
? ? if (StatusRegister & (? ? ? STATUS_BIT_MEMORY_ERROR
? ? ? ? ? ? ? ? ? ? ? ? ? ? |???STATUS_BIT_LINK_PROTO_ERR)){

? ? ? ? devExt->InterruptReason[ISR_FOR_ERRORS].Active = TRUE;
? ? ? ? devExt->InterruptReason[ISR_FOR_ERRORS].ShadowIsrStatusRegister? = StatusRegister;
? ? ? ? devExt->InterruptReason[ISR_FOR_ERRORS].ShadowIsrControlRegister = ControlRegister;

? ? ? ? //
? ? ? ? //? For now, just clear the errors
? ? ? ? //
? ? ? ? DisableErrorInterrupts(devExt);
? ? ? ? ClearErrors(devExt);

? ? ? ? OurInterrupt = TRUE;
? ? }

? ? //
? ? //? Check if it’s our DMA interrupt
? ? //
? ? if (StatusRegister & STATUS_BIT_DMA_INT) {

? ? ? ? devExt->InterruptReason[ISR_FOR_DMA].Active = TRUE;
? ? ? ? devExt->InterruptReason[ISR_FOR_DMA].ShadowIsrStatusRegister? = StatusRegister;
? ? ? ? devExt->InterruptReason[ISR_FOR_DMA].ShadowIsrControlRegister = ControlRegister;

? ? ? ? //
? ? ? ? //? Turn off the DMA_ENABLE bit to deassert the interrupt.
? ? ? ? //
? ? ? ? ClearControlRegisterBits(devExt, CTRL_BIT_ENABLE_DMA);
? ? ? ? DisableDmaInterrupts(devExt);

? ? ? ? OurInterrupt = TRUE;
? ? }

? ? //
? ? //? Check if it’s our CANbus interrupt
? ? //
? ? if (StatusRegister & STATUS_BIT_CAN_INT) {

? ? ? ? devExt->InterruptReason[ISR_FOR_CANBUS].Active = TRUE;
? ? ? ? devExt->InterruptReason[ISR_FOR_CANBUS].ShadowIsrStatusRegister? = StatusRegister;
? ? ? ? devExt->InterruptReason[ISR_FOR_CANBUS].ShadowIsrControlRegister = ControlRegister;

? ? ? ? //
? ? ? ? //? We must disable these here too
? ? ? ? //
? ? ? ? DisableCanbusInterrupts(devExt);

? ? ? ? if (devExt->FirmwareProductVersion >= 2){
? ? ? ? ? ? //
? ? ? ? ? ? //? Debug timers
? ? ? ? ? ? //
? ? ? ? ? ? PDEBUG_LATENCY_TIMERS Timers = &devExt->Statistics.DebugLatencyTimers;

? ? ? ? ? ? StopCanbusTimer(devExt);
? ? ? ? ? ? Timers->CanbusRegister = ReadCanbusTimerReg(devExt);
? ? ? ? ? ? WriteCanbusTimerReg(devExt, 0);

? ? ? ? ? ? if (Timers->CanbusRegister > Timers->MaxCanbusIsrLatency){
? ? ? ? ? ? ? ? Timers->MaxCanbusIsrLatency = Timers->CanbusRegister;
? ? ? ? ? ? }

? ? ? ? ? ? __int64 TimerTmp =? ? (Timers->AverageCanbusIsrLatency * Timers->CanbusIsrCount)
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? + Timers->CanbusRegister;
? ? ? ? ? ? Timers->CanbusIsrCount++;

? ? ? ? ? ? Timers->AverageCanbusIsrLatency = (ULONG)(TimerTmp / Timers->CanbusIsrCount);
? ? ? ? }

? ? ? ? OurInterrupt = TRUE;
? ? }

? ? //
? ? //? If it wasn’t ours, get out!
? ? //
? ? if (!OurInterrupt){
? ? ? ? devExt->Statistics.NotOurInterrupt++;
? ? ? ? return (FALSE);
? ? }

? ? //
? ? //? Measure the ISR / DPC latency
? ? //
? ? WriteDebugTimerReg(devExt, 0);
? ? StartDebugTimer(devExt);

? ? //
? ? // Ok - if we’re here it is out interrupt.
? ? // Queue a DPCForISR
? ? //
? ? IoRequestDpc(???devExt->FunctionalDeviceObject,
? ? ? ? ? ? ? ? ? ? NULL,
? ? ? ? ? ? ? ? ? ? NULL);

? ? return (TRUE);
}

VOID DpcForIsr(? PKDPC? ? ? ? ???Dpc,
? ? ? ? ? ? ? ???PDEVICE_OBJECT? DeviceObject,
? ? ? ? ? ? ? ???PIRP? ? ? ? ? ? Unused,
? ? ? ? ? ? ? ???PVOID? ? ? ? ???Context)
{
? ? //-------------------------------------------------------------------------------------
? ? PGEMINI_DEVICE_EXT? devExt = (PGEMINI_DEVICE_EXT) DeviceObject->DeviceExtension;
? ? KIRQL? ? ? ? ? ? ???oldIrql;
? ? PIRP? ? ? ? ? ? ? ? readIrp;
? ? ULONG? ? ? ? ? ? ???i;
? ? ISR_DATA? ? ? ? ? ? InterruptReason[NUM_INTERRUPT_REASONS];
? ? BOOLEAN? ? ? ? ? ???IrpIsCancelled;
? ? NTSTATUS? ? ? ? ? ? status? ? ? ? ? = STATUS_SUCCESS;
? ? NTSTATUS? ? ? ? ? ? ErrorIsrStatus? = STATUS_SUCCESS;
? ? NTSTATUS? ? ? ? ? ? DmaIsrStatus? ? = STATUS_SUCCESS;

? ? QUEUED_IO_COMPLETE_REQUEST? QueuedIoCompleteRequests[MAX_QUEUED_IO_COMPLETE_REQUESTS];
? ? ULONG? ? ? ? ? ? ? ? ? ? ???CurrentIoCompleteQueue;
? ? //-------------------------------------------------------------------------------------

? ? UNREFERENCED_PARAMETER(Dpc);
? ? UNREFERENCED_PARAMETER(Unused);
? ? UNREFERENCED_PARAMETER(Context);

? ? //
? ? //? By definition DPC’s are called at DISPATCH_LEVEL, so we shouldn’t
? ? //? ever not be called at DISPATCH_LEVEL
? ? //
? ? ASSERT(KeGetCurrentIrql() == DISPATCH_LEVEL);

? ? //
? ? //? Always check the signature of the struct!
? ? //
? ? ASSERT(devExt->magic == GEMINI_DEVICE_EXT_MAGIC);

? ? //---------------------------------------------------------------------
? ? //
? ? //? Grab our interrupt spinlock
? ? //
? ? oldIrql = KeAcquireInterruptSpinLock(devExt->InterruptObject);

? ? //
? ? //? Make a local copy of the device struct
? ? //
? ? RtlCopyMemory(? &InterruptReason,
? ? ? ? ? ? ? ? ? ? &devExt->InterruptReason,
? ? ? ? ? ? ? ? ? ? NUM_INTERRUPT_REASONS * sizeof (ISR_DATA));

? ? //
? ? //? And zero out the DevExt copy
? ? //
? ? RtlZeroMemory ( &devExt->InterruptReason,
? ? ? ? ? ? ? ? ? ? NUM_INTERRUPT_REASONS * sizeof (ISR_DATA)
? ? ? ? ? ? ? ? ? ? );

? ? //
? ? //? Release our interrupt spinlock
? ? //
? ? KeReleaseInterruptSpinLock(devExt->InterruptObject, oldIrql);
? ? //---------------------------------------------------------------------

? ? //
? ? //? Zero our Queue of Irp Completions right away
? ? //
? ? RtlZeroMemory(? QueuedIoCompleteRequests,
? ? ? ? ? ? ? ? ? ? sizeof(QUEUED_IO_COMPLETE_REQUEST) * MAX_QUEUED_IO_COMPLETE_REQUESTS
? ? ? ? ? ? ? ? ? ? );

? ? CurrentIoCompleteQueue = 0;

? ? //
? ? //? CANbus data or event? ---------------------------------------------
? ? //
? ? if (InterruptReason[ISR_FOR_CANBUS].Active) {

? ? ? ? HandleCanbusDpc(devExt);

? ? ? ? //
? ? ? ? //? Enable the interrupt when we are done
? ? ? ? //
? ? ? ? EnableCanbusInterrupts(devExt);
? ? }

? ? //
? ? //? Did we have any errors? --------------------------------------------
? ? //
? ? if (InterruptReason[ISR_FOR_ERRORS].Active) {

? ? ? ? //
? ? ? ? //? Check these for the statistics we should be keeping
? ? ? ? //
? ? ? ? ErrorIsrStatus = CheckStatusErrorsForImageRead( devExt,
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? InterruptReason[ISR_FOR_ERRORS].ShadowIsrStatusRegister);

? ? ? ? //
? ? ? ? //? Enable the interrupt when we are done
? ? ? ? //
? ? ? ? EnableErrorInterrupts(devExt);

? ? ? ? //
? ? ? ? //? More processing here…
? ? ? ? //

? ? }

? ? //
? ? //? Is an Image Read complete? ----------------------------------------
? ? //
? ? if (InterruptReason[ISR_FOR_DMA].Active) {

? ? ? ? //
? ? ? ? //? More processing here…
? ? ? ? //

? ? }// End of - InterruptReason[ISR_FOR_DMA]

? ? //
? ? //? More processing here…
? ? //

? ? return;
}


NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer

If another interrupt occurs after DPC got dequeued but before DPC routine acquires interrupt spinlock, it may well happen that DPC that got queued upon the second interrupt occurence does not have to do any processing, unless yet another interrupt had occurred after lines got re-enabled but before DPC got dequeued. Your code does not seem to take it into consideration. I suspect this is where problem may be rooted…

In general, I think you should, probably, replace a DPC for ISR with custom DPCs , i.e. one DPC per interrupt reason. Otherwise, you will have to work quite a lot on synchronization that becomes quite tricky issue with quite a few “small nasty details”…

Anton Bassov

>

Thank you everyone for taking the time to look at this.

First, it?s good news to hear that I will be interrupted again if the
interrupt doesn?t go away. This is what I expected and designed for.

The DPC is supposed to handle it if the interrupt was called multiple
times before the DPC got to run. I actually have 3 sets of shadow
registers, one for each interrupt reason. And the DPC should also
handle
the cases when it?s run on multiple processors at the same time.
Probably
the easiest thing to do now is to attach a slightly paired down
version of
the DCP and ISR code for everyone to look at, because I can?t find
anything wrong with it so far.

In the Dpc, you don’t call EnableXxxInterrupt with the spinlock held.
Presumably this writes to the Control Register… is it atomic (eg
InterlockedXxx)? Can you post the DisableXxxInterrupt and
EnableXxxInterrupt routines?

You don’t EnableDmaInterrupt in your DPC… is that just an omission due
to the cut down version of the code you have provided?

Why cache the Control Register at all?

James

Let me try and answer all the questions posed so far.

First off, I really like the idea of queuing a DPC for each interrupt reason. I think this could simplify things a lot. And once I looked at the code for IoRequestDpc(), it seems easy enough to do. But before I do this, if it?s possible I?d like to understand where the logic error is first, if there is one, so I don?t make the same mistake again.

I am working on a test version of the driver where I am going to pull out more of the interrupt related fields of the devExt structure, along with a few more statistics, since I can?t attach a debugger to some of the machines this occurs on.

The logic behind the DPC was to make a local copy of the InterruptReason structure array while holding the ISR spinlock, zero out the devExt copy, and then drive all the DPC Logic from the local copy. If the ISR is called twice before the DPC is dequeued, the DPC should see both structure elements as Active and handle them in one pass. If a DMA interrupt occurs, the DPC starts to handle it, and then a CANbus interrupt occurs, the DPC should run twice in a row, or even at the same time on a multiprocessor system. If the DPC was queued and it had nothing to do, all 3 if statements should evaluate to false, and it should just fall through. The ?do more processing? at the end of the DPC is actually some clean-up code if one of the ?if? statements executed and had errors, so nothing should run there either in this case. At least, this is what I was going for. But it wouldn?t be the first time I?ve messed something up.

The DisableXxxInterrupt and EnableXxxInterrupt functions are indeed wrappers to flip control register bits. The HW engineers used an interesting technique to guarantee bit setting and clearing is atomic, to set a bit, the high order bit must also be set. To clear a bit, the high order bit must be cleared. Bits that are 0 do nothing when the high order bit is set. Bits that are 1 do nothing when the high order bit is cleared. I think as long as writes to the PCI bus are not interleaved, I?m ok (I hope).

The copies of the Control register are a leftover from when the HW engineer wasn?t sure if bits in there could change by themselves. (Something else to cut out next release I think).

The logic to re-enable DMA was indeed cut out, because sometimes I don?t re-enable it if the transfer is complete, but yes, it lives inside the DMA if statement.

Here?s the code for the Disable / Enable calls:

VOID
DisableCanbusInterrupts(PGEMINI_DEVICE_EXT DevExt)
{
ClearControlRegisterBits(DevExt, CTRL_BIT_CAN_INTERRUPT);
}

VOID
DisableDmaInterrupts(PGEMINI_DEVICE_EXT DevExt)
{
ClearControlRegisterBits(DevExt, CTRL_BIT_DMA_INTERRUPT);
}

#define CTRL_MASK_SET 0x80000000
#define CTRL_MASK_CLEAR (~CTRL_MASK_SET)

VOID
ClearControlRegisterBits( PGEMINI_DEVICE_EXT DevExt,
ULONG Data)
{
//-----------------
ULONG reg;
//-----------------

reg = Data & CTRL_MASK_CLEAR;

WriteRegisterULongBar0(DevExt, OFFSET_CONTROL_REG, reg);
return;
}

VOID
WriteRegisterULongBar0( PGEMINI_DEVICE_EXT DevExt,
ULONG Register,
ULONG Data)
{
WRITE_REGISTER_ULONG((PULONG)((ULONG_PTR)DevExt->BAR0 + Register), Data);

return;
}

Wait, I made a mistake regarding how the control register works.

It should be “to set a bit, the high order bit must also be set. To clear a bit, the high order bit must be cleared.
To clear a bit, set bit 31 = 0, and the bit(s) you want cleared to 1. To set a bit, set bit 31 = 1 and the bit(s) you want set to 1.”

The code should be right even if my explanation is lacking.

>

The logic behind the DPC was to make a local copy of the
InterruptReason
structure array while holding the ISR spinlock, zero out the devExt
copy,
and then drive all the DPC Logic from the local copy. If the ISR is
called twice before the DPC is dequeued, the DPC should see both
structure
elements as Active and handle them in one pass. If a DMA interrupt
occurs, the DPC starts to handle it, and then a CANbus interrupt
occurs,
the DPC should run twice in a row, or even at the same time on a
multiprocessor system. If the DPC was queued and it had nothing to
do,
all 3 if statements should evaluate to false, and it should just fall
through. The ?do more processing? at the end of the DPC is actually
some
clean-up code if one of the ?if? statements executed and had errors,
so
nothing should run there either in this case. At least, this is what
I
was going for. But it wouldn?t be the first time I?ve messed
something
up.

The DPC will run twice in a row if it is already running when you queue
it again, but if the IP (instruction pointer - sorry you probably
already knew that :slight_smile: is between the start of the Dpc and the
KeAcquireInterruptSpinLock when an interrupt occurs, then isn’t the
first copy of the devExt->InterruptReason lost?

The first DPC call will get the InterruptReason as at the second ISR
because the DPC hadn’t gotten that far yet, and the second DPC call will
get an empty InterruptReason because the first DPC cleared it.

James

> if the IP (instruction pointer - sorry you probably already knew that :slight_smile: is between the start of the Dpc

and the KeAcquireInterruptSpinLock when an interrupt occurs, then isn’t the first
copy of the devExt->InterruptReason lost?

Of course not - the original DPC will handle both cases in one go, but the second DPC will be no-op unless
yet another interrupt occurs while the second DPC is still in a queue…

Anton Bassov

Hi All,

Some additional information here. I have gotten the failure to happen on a development machine running a debug version of the driver. I added counters to the ISR and DPC for each reason that the ISR or DPC could be running, in addition to bringing out the InterruptReason array from the devExt.

The data at the time of failure looks as follows (these are just counters):
Error ISR: 3
Dma ISR: 9998
Can ISR: 304900
Error DPC: 3
Dma DPC: 9998
Can DPC: 304900

The InterruptReason is all zeros, no interrupts pending from the ISR for the DPC.

(The 3 error ints and Dpcs are probably red herrings, this is just saying that I set up the test generator DMA incorrectly.)

The Status register says that internal CANbus and DMA interrupt lines are active. The Enable bits for each of these in the Control register are also active. The PC is still running, mouse, keyboard, and network all work.

It looks like someone (Windows, the PCI / PCI Express bus, or the PCI Express card) just dropped an interrupt and never called the driver’s ISR. The driver’s interrupts and DPCs are lock in step with each other, the hardware says it should be interrupting, but we are not seeing our interrupt. Which brings me right back to what Peter was first saying about dropping an edge triggered interrupt somehow. But if Windows guarantees that it will continue to run and re-run the interrupt chain until the interrupt is cleared, and PCI express does emulate level triggered interrupts, then I am leaning towards our PCI Express card as the cause.

Best Regards,
-Mike

Change your interrupt service routine to check if ANY interrupting bits
are set and if so, save the interrupt status in the devExt, disable ALL
interrupts, and request the DPC. Forget this INTERRUPT_REASON stuff -
find the reason from the saved status register in the DPC and handle it
there.

if (StatusRegister & (STATUS_BIT_MEMORY_ERROR |STATUS_BIT_LINK_PROTO_ERR |
STATUS_BIT_DMA_INT | STATUS_BIT_CAN_INT))
{
devExt->StatusRegister = StatusRegister;
DisableALLInterrupts (devExt);
IoRequestDpc (devExt->FunctionalDeviceObject,
NULL,
NULL);
return (TRUE);
}

return (FALSE);

You’ll have a lot less trouble! No interrupts during the DPC, no need for
multiple DPCs, no muss, no fuss.

xxxxx@hologic.com
Sent by: xxxxx@lists.osr.com
02/22/2009 10:58 PM
Please respond to
“Windows System Software Devs Interest List”

To
“Windows System Software Devs Interest List”
cc

Subject
RE:[ntdev] Loosing Interrupts on a PCI Express card running under XP

Hi All,

Some additional information here. I have gotten the failure to happen on
a development machine running a debug version of the driver. I added
counters to the ISR and DPC for each reason that the ISR or DPC could be
running, in addition to bringing out the InterruptReason array from the
devExt.

The data at the time of failure looks as follows (these are just
counters):
Error ISR: 3
Dma ISR: 9998
Can ISR: 304900
Error DPC: 3
Dma DPC: 9998
Can DPC: 304900

The InterruptReason is all zeros, no interrupts pending from the ISR for
the DPC.

(The 3 error ints and Dpcs are probably red herrings, this is just saying
that I set up the test generator DMA incorrectly.)

The Status register says that internal CANbus and DMA interrupt lines are
active. The Enable bits for each of these in the Control register are
also active. The PC is still running, mouse, keyboard, and network all
work.

It looks like someone (Windows, the PCI / PCI Express bus, or the PCI
Express card) just dropped an interrupt and never called the driver’s ISR.
The driver’s interrupts and DPCs are lock in step with each other, the
hardware says it should be interrupting, but we are not seeing our
interrupt. Which brings me right back to what Peter was first saying
about dropping an edge triggered interrupt somehow. But if Windows
guarantees that it will continue to run and re-run the interrupt chain
until the interrupt is cleared, and PCI express does emulate level
triggered interrupts, then I am leaning towards our PCI Express card as
the cause.

Best Regards,
-Mike


NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

> I added counters to the ISR and DPC for each reason that the ISR or DPC could be running

You did not have to - it is pretty obvious that, in numeral terms, you would get a perfect match between ISRs and DPCs as you did. I think the problem lies with some logical mismatch when it comes to re-enabling a line in DPC. The way you put it yourself, “sometimes I don?t re-enable it if the transfer is complete”. I think this is where your problem is rooted. Furthermore, I have a weird feeling that running no-op DPC makes some contribution to this logical mismatch, although I cannot tell you how exactly it happens unless you show us absolutely all your code…

Anton Bassov