Strange interrupt behavior

Hello,

The company I work for designs and manufactures various types of I/O cards.

A customer is currently seeing some strange interrupt behavior with one of our PCIe digital I/O boards. This board and the WDM driver (written by a consultant) have been around for years. We can’t reproduce the problem and no other customers have reported it. My knowledge of kernel drivers and kernel debugging is limited.

I’m hoping one of you may have some ideas on what might be going on.

The customer runs a console application which writes a few registers to configure change-of-state interrupts. He then changes the voltage on the input channel. He should get an interrupt at this point, but he doesn’t. If he uses the console app to read any register on board, then he will see the interrupt.

We had the customer repeat the test using a debug build of the driver with DebugView. No ISR trace statements were seen when the channel was toggled. A subsequent register read was required to see them.

Some details:
Windows 10 Enterprise LTSC
Rackmount computer with Xeon W-2123 CPU
Board is using MSI
The board’s registers are mapped into user space. The console application read/writes the board through a user space pointer.

Ideally, we would be able to get our hands on their system and hook it up to a bus analyzer. However, that doesn’t seem likely.

I realize this isn’t a lot to go on but any ideas?

Thanks.

Yeah… wow.

Hmmmm… You’ve tried swapping the card they’re using, I assume (to rule-out some weird problem with the hardware)?

Could it be a SPEED thing? Maybe the customer’s CPU is writing the registers “really fast” and the changes aren’t being “seen” by the hardware? Has he tried inserting arbitrarily long waits between the writes?

Could it be a pipelining/fence/barrier issue? He should be calling FastFence() after his register write, and he should be calling _ReadWriteBarrier() before his register reads.

Sure… these are desperate measures. But, I sorta felt like I should suggest SOMEthing, so… :wink:

Peter

Just a guess, but perhaps whatever the customer has done while “write(ing) a few registers to configure change-of-state interrupts” has put the hardware in a state that requires a read operation to clear it? It wouldn’t be the first hardware device with this sort of ‘feature’.

In an earlier version of a driver I’m maintaining I had a similar problem. The driver had worked for many years on many systems, but started to fail on a couple of new systems with new CPU types or chipsets.

The reason was that my original driver code accessed memory mapped registers just using the memory addresses as pointers, i.e.

*cmd_reg = cmd;
val = *data_reg;

Obviously some re-ordering occurred with new systems that caused the driver to fail.

Using predefined functions like READ_REGISTER_ULONG() and WRITE_REGISTER_ULONG() instead of direct pointer dereferencing fixed the problem because AFAIK these low level functions use some barrier instructions to avoid re-ordering of register accesses.

I suspect Martin is correct.

Have you changed from PCI to PCIe over time? Has it always been MSI?

If you read the documentation on WRITE_REGISTER_ULONG() https://docs.microsoft.com/en-us/windows-hardware/drivers/ddi/wdm/nf-wdm-write_register_ulong the note that it inserts a memory barrier is important.

PCI ordering is a thing. Some host bridges will do write combining, which I’ve heard some people describe as not a “host bridge”, but a “host disaster”.
My suspicion is that the write isn’t making it’s way to the card due to write combining. The write actually doesn’t arrive until the subsequent read is issued. A bus analyzer would really help. Actually owning the driver would also help.

Yes… I had similar issues getting some “quick prototyping code” to work properly. Which is why I suggested calling FastFence() after register writes, and _ReadWriteBarrier() before register reads. This is what WRITE_REGISTER_xxx and READ_REGISTER_xxx do, respectively.

:slight_smile:

There are some things that we can say for sure are not the problem. The version of Windows is irrelevant and so is the driver code. The wisdom of this design depends on your use case, but based on what you have the problem is either in your UM console program code, or in your hardware. There is a tiny chance that there is some problem with the chipset on this particular motherboard, but I would discount that as highly unlikely. If there was a significant problem with the chipset, it would probably render the machine un bootable

From your description, I assume that the speed of accessing the hardware is slow. As in human speed with several seconds between steps. One thing that is not clear is if the interrupt that eventually arrives is the one that should have come earlier, or a new interrupt triggered after the read

I assume also that your console program is single threaded? And presumably, the device memory is mapped as uncached without write combining?

Xeon W-2123 is a Skylake processor that does not support multiple CPUs per system. With this configuration, memory barriers, or the lack there of, seem an unlikely cause since there will be only one memory controller and no inter-chip coherency.

Compile time reordering could be an issue, but then you would expect to see the same kinds of problems for anyone using the same binary version of that console program. It seems safe to discount that too

The most likely problem is a logic issue in your hardware, but with only this much to go on, it will be very hard to proceed

Thanks for the ideas everyone.

We do have the source code for the driver. It calls MmMapLockedPagesSpecifyCache with MmNonCached to map the board’s memory. The driver doesn’t do much. Most of the device logic is in a DLL which accesses the board using a volatile pointer to the mapped memory. The console app I mentioned is just a demo we provide showing how to use the DLL functions.

The only place the driver read/writes the board is in the interrupt handler. This handler just verifies the board is interrupting, clears the interrupt enable bit and signals the DLL to finish the processing. The DLL creates a separate thread to listen for interrupts. The driver only uses WRITE_REGISTER_xxx and READ_REGISTER_xxx functions to access the hardware.

I’ll try adding some calls to _ReadWriteBarrier() and __faststorefence() to the user space code and have the customer try it.

Thanks again.

@AIWM

This handler just verifies the board is interrupting, clears the interrupt enable bit and signals the DLL
Who sets the interrupt enable bit and when?

If he uses the console app to read any register on board, then he will see the interrupt.
Does this mean that the reading any register by the console app causes the interrupt to arrive all the way to the user DLL?
(this would hint to the write buffer not flushed)
Or that reading by the console app sees the interrupt request bit (?) active on the card, but the interrupt does not occur?

Who sets the interrupt enable bit and when?
When the board is initially “opened” by the console app, the Board Interrupt Enable bit is not set. The app has a Configuration menu that reads and displays the interrupt related register contents and allows the user to change them. This is where Board Interrupt Enable bit initially gets set. After any change, the menu reads and displays the registers again. When an interrupt occurs, the kernel driver clears this bit and signals the DLL. The DLL’s ISR services the individual channel conditions and then turns the Board Interrupt Enable bit back on. Then the DLL invokes a callback function in the console app. For debugging purposes, the callback prints that an interrupt was received.

Does this mean that the reading any register by the console app causes the interrupt to arrive all the way to the user DLL?
When the customer changes the channel voltage he doesn’t see the callback print statement as expected. If he then accesses the console app’s Configuration or Status menus (which automatically read some registers) he sees the callback statement printed. I’m guessing reading any register will do the trick but I’ve asked the customer to verify that.

Thank you for the more detailed description.
Now it looks like a usermode software issue. Maybe, an exception in the “DLL’s ISR”, handled silently. Not a kernel-side issue.
Can the customer test on a clean Windows, with no antiviruses and other junk?

Can the customer test on a clean Windows, with no antiviruses and other junk?
That’s a good point. I need to find out if they have other software running during their tests.