Windows System Software -- Consulting, Training, Development -- Unique Expertise, Guaranteed Results
The free OSR Learning Library has more than 50 articles on a wide variety of topics about writing and debugging device drivers and Minifilters. From introductory level to advanced. All the articles have been recently reviewed and updated, and are written using the clear and definitive style you've come to expect from OSR over the years.
Check out The OSR Learning Library at: https://www.osr.com/osr-learning-library/
I am converting a working DOS driver that services interrupts every 100us. I have the code functioning as a Windows driver, but it is way too slow.
I'm not currently doing anything with user mode. I am just catching an interrupt, setting up for the next interrupt, and returning TRUE (interrupt was serviced).
Timing (measured with KeQueryPerformanceCounter with 10Mhz / 100ns granularity) shows I get in and out of the ISR in 9.5us. There is no DPC. The time events are logged to an array then analyzed after 5,000 interrupts. Reading the timer itself I measured takes ~150ns, so I don't expect this impacts my measurements.
Timing the entries to the ISR defined by IoConnectInterrupt() shows I am getting an interrupt approximately every 1.6ms rather than at the 100us rate I am expecting them. This is very consistent and the delays are not happening in bursts. I've tried different CPUs with very different clock rates, and the 1.6ms rate remains consistent.
My driver is sending and receiving a small packet of data by DMA. Received DMA causes the interrupt. I based the design on a WDF DMA sample. The data being transferred is all in the extension and is not using user buffers, so I removed the buffer allocation and queuing that came in the sample. The data is arriving and looks correct.
Is there something I need to do to tickle the operating system to make these interrupts arrive faster?
|Upcoming OSR Seminars|
|OSR has suspended in-person seminars due to the Covid-19 outbreak. But, don't miss your training! Attend via the internet instead!|
|Kernel Debugging||30 January 2023||Live, Online|
|Developing Minifilters||20 March 2023||Live, Online|
|Internals & Software Drivers||17 April 2023||Live, Online|
|Writing WDF Drivers||22 May 2023||Live, Online|
I'm not clear from that explanation... what TRIGGERS the interrupt? Tell us more about your device... a PCIe device, I would suppose. Tell us some more about it.
Thank you. It is a PCI device. Specifically a PLX9054 interface. It is a bus master and is programmed to trigger an interrupt on receive DMA complete.
Mmmm…. A venerable device, the 9054.
And it does these DMA transfers every 100uSecs? That’s a LOT of interrupts…. Ten thousand per second (fixed my math from my original post). That’s a higher rate that I’ve encountered. I wouldn’t count on the system being able to keep up with that.
Edited to fix my dumb math error
Well the first answer is that Windows isn't DOS and you are unlikely to be happy with the timing that you get if you actually need consistent MHz timings. In my experience, this was a bigger issue back in the XP days when many devices were being converted. Devices like piezoelectric strain gauges, and medical sensors could no longer be controlled directly by the host OS, but now had to be controlled by a micro controller that would buffer and feed data to the host OS.
it might be possible to help more, but to do that we will need to know more about what your device does and what you are expecting the driver to do with it
(I apologize for the math error in my previous post… now fixed)
What kind of host system do you have? Is the PCI interrupt shared with something? Are there other available PCI slots to try with this device?
Thanks for the math correction. 100us is a 10Khz rate as you state (not 10Mhz as I originally said), and yes that is pretty fast.
I don't think anything is sharing the interrupt. I tried making it exclusive as a test and nothing complained. In DOS I was able to work out the interrupt on the APIC that the PCI is routed to, and Windows seems to end up with that same number. The conventional (8259) interrupt is shared by various things, but I don't think that matters in Windows; in DOS I needed to use an extender to get to the APIC.
Understood that Windows isn't DOS and has things like multiple cores servicing interrupts to deal with.
It doesn't need to be extremely consistent, but the ISR cannot be blocked for more than 100us (2 unserviced interrupts in a row is a problem).
The host system is an industrial PCI (not PCIe) processor board in a mostly passive motherboard. The latest version supports an i7 LGA1151 processor. The ones we've been using have 4 cores, but there are up to 12 on an i7 (I think half of which are physical cores that can service an interrupt).
I'm fairly certain I'm botching something as it is only able to interrupt every 1.6ms, which I believe (since I botched my math badly - I double checked this time) is a 625 Hz rate. My suspicion (I'm probably just displaying my ignorance here) is that it has something to do with no DPC; that the DPC completing causes ISR rescheduling as well as rescheduling for user mode tasks, but as I won't be communicating each interrupt completion to a user there isn't much of a point to a DPC or any object for a DPC to complete against.
Well, I can set your mind at ease: It doesn’t have anything to do with having or not having a DPC. I promise you. So rest your mind that’s not the issue.
I suspect something much more basic… like your source not actually generating interrupts at the rate you expect. Can you verify that?
We need to figure out (a) the actual rate at which the source is generating interrupts, a d then (b) the cause of the latency.
Thank you Peter. It saves some work to not play around with DPC as a black box.
I can verify the hardware works by running the DOS version that I can demonstrate is running at full speed. There is a hardware watchdog that detects missed and slow interrupts, and they aren't even slow running on an Atom CPU. I am not seeing a difference in timing or behavior between an Atom CPU and an i3, so it does not seem like a problem with CPU capability, with the possible exception of the number of cores (4 in both processors I've tested, which I'm assuming means 2 physical cores.)
The mechanism is command-response so that a DMA from the CPU will enable a response DMA from the board at the next 100us period. The DMA originating from the CPU is started as part of the ISR, and once in response to IRP_MN_START_DEVICE.
Why have you created a WDM driver? That’s not a good decision.
So, hmm…. I don’t know what to suggest. I feel like I must be missing something about your device. That *you** must be missing something in how you’ve got the device programmed, or connected to interrupts, or (as somebody else suggested) your device is sharing a very popular interrupt, or something.
Have you put a logic analyzer on the interrupt line to see how things look?
In your position there are several things I would try to do (some may simply not be possible) ...
First, see if there's a GPIO pin you can access from the driver on the hardware; I typically have at least one wired into a BAR register for just these kinds of problems ... if you can do that then debugging timing issues becomes easier; when the interrupt is triggered have the firmware raise the GPIO, when the driver gets the interrupt it lowers the GPIO, hook up the logic analyzer and you've got your timing info ...
Next, realize that the PLX9054 uses shared level interrupts and so might other cards in the system ... maybe even at the same time [ http://www.hollistech.com/Resources/Misc articles/pciInterrutps.htm ]. I would check to see if there is anything else in the system that might be attempting to share that IRQL
Finally, make sure that you're clearing the interrupt flag as soon as you validate the interrupt is yours and not disabling interrupts while you're processing the DMA (there are other, better ways to do flow control than dropping the enable/ disable hammer)
I thought what I created was a combination of WDM and KMDF. The reason I created a WDM driver was because I went looking for sample PLX code and that was the sample Microsoft provided. The PLX sample code similarly uses WDM. I can see my needs are much simpler than what is provided in the WDM framework since I ended up removing a great deal. Simple answer: ignorance.
Now that I have most of the pieces of logic we need in place, I don't mind restarting from a different model and porting those pieces over.
I'm fairly certain our CPU cards do have GPIO output; we haven't been using it. I'll look into that, but it sounds like I have more fundamental issues (like the wrong driver type for the purpose).
The PLX samples, unfortunately, are somewhat older than most of the participants on this list ... it would be nice if MS deigned to modernize them, bring them into this century or even just provide another more modern sample somewhere ...
There are several more modern PCIe DMA driver sample out there such as this [ https://github.com/usnistgov/RIFFA_Driver ] which will probably give a better starting point than the PLX9054 sample
It uses MSI interrupts, 64bit bus master S/G DMA and is fully KMDF ...
I didn't think MSI interrupts work on a purely PCI device; I got the impression the 9054 in particular does not support them. The DMA packets are all tiny, DMA will never go directly to user space, and so one of the things I ended up doing was removing the scatter/gather logic.
I will take a look at the RIFFA driver to see if I can make that work...
I got as far as interrupt handling and it appears the framework takes care of whether the interrupt is MSI capable and uses the older wired interrupt if MSI packets are unsupported, so that doesn't appear to be an issue.
In a short time I can say it looks a lot like the Microsoft sample I started with that comes in the Windows DDK Samples and is found in:
I still have some work to do to make a good test of that logic; I will update later with any results I get using more purely KMDF.
Not so. MSI was first introduced in PCI 2.2. PCIExpress borrowed and extended the concept.
Tim Roberts, [email protected]
Providenza & Boekelheide, Inc.
The 9054 is a really old device (it dates from 1998), and it IS PCI 2.2 compliant. It includes a classic PCI bridge, and it does NOT support MSI (it uses INT A#).
Once again, OP... I'm going to suggest you put a logic analyzer on the interrupt line and take some measurements. This can’t take more than 20 minutes, and might reveal the source of your issue.
DPCs help you process more data with fewer interrupts. That's a good thing because too many interrupts reduces performance. This difference becomes more significant the more work that the OS has to do to maintain proper context on each, and this is an area where different OSes can show different behaviour on the same hardware
Using a DPC or not should have no bearing on the latency of getting your ISR called, but doing too much work in your ISR can. What exactly are you doing in that ISR?
Also note that the CPU does not do DMA per se. Nearly every instruction it will execute will directly modify system memory, but DMA implies that your device access system memory without involving the CPU
What is the IRQ number displayed in device manager? Again, can you plug the card into other slot?
The only thing that comes to mind is some difference in interrupt routing behavior between real mode and Windows
(the BIOS knows that Windows is running, thanks to ACPI).
By the way - you don't mention the Windows version, is it Win7 (server 2008)?
I thought I posted this yesterday.
I'm not doing a lot in the (DMA read complete) ISR. I measure it at 9.5us. I copy off some telemetry and start a DMA write back to the board. I should be consuming <10% of one physical core in the ISR by those estimates, but I have no idea what other overhead that creates.
Right now I am re-implementing without WDM based on earlier suggestions. I'm assuming WDM has a high overhead.
My understanding was that DPCs get you out of the ISR sooner and operate at a lower priority so that other interrupts can perform more time critical tasks, and also to allow interface to objects that are not legal to touch during ISR.
The IRQ in device manager is 16. If I change slots the interrupt changes. In my DOS extender experiments I found PCI interrupts A-D are mapped to APIC interrupts 16-19 (as I believe is typical). I just verified this: moved the board two slots over and it changed from 16 to 18 (in Windows), as expected. In real mode it would have been mapped to IRQ 10, but the DOS extender runs in protected mode and I was able to reinvent Windows enough to use the APIC instead of the legacy 8259 PIC.
I am running this on Windows 10. It is the LTSC OEM version, which is enterprise without some of the nonsense.
You can't. I'm not sure what you really meant by that, but WDM IS the low-level Windows API. You can't leave home without it. Did you perhaps mean KMDF?
9.5us is a very long time for an ISR.
Tim Roberts, [email protected]
Providenza & Boekelheide, Inc.
9.5us is indeed ... OP, you're likely already doing this but the ISR should literally only be doing four things: a) reading a register on the device to determine "what happened", b) dispatching a DPC containing only the contents of that register, c) writing to another register on the device to clear the interrupt and d) reporting to the OS that the interrupt has been handled. That's it, four things, six to eight lines of code.
Everything else, everything should be handled in the DPC ... out of curiosity I hooked up my trusty Artix board to a logic analyzer to get some timings on the pertinent lines, doing just that on Win10 21H1 on an I7 took about 800ns for that ISR ... granted there is a world of difference between PCIe IP running on an FPGA and a PLX9054, but it illustrates that the ISR really should do it's thing in under a microsecond on modern hardware ...
And KMDF really is your friend, even though the RIFFA driver isn't an exact match it's still a good match for a starting point ...
First KMDF is built on top of WDM. It is highly unlikely that you want to use WDM directly because that API (DDI) is VERY complex to use correctly. KMDF provides all of the boiler plate code that you need to do all of those things properly. It would be wrong to suppose that either model has higher overhead than the other
but the bigger issue is 10us in an ISR. Typical ISR durations are measures in ns. If your measurements are correct, this would exactly explain your performance issues.
So, the OP’s ISR takes way too long… assuming his measurements and math are correct. But:
is a bit too dogmatic, even for me. And not frequently “best practice” either. Consider the lowly traditional keyboard driver. We wouldn’t really want to queue a DPC for every key press, right?
Somewhere between what we at OSR call The Famous Four Steps (things you do in an ISR) and writing all your processing code in your ISR, reasonable engineers can agree a balance.
But back to the OP’s actually issue. We still don’t know how frequently the interrupt is being generated, and he refuses — apparently — to measure this for us, after I have asked him to do so multiple times. So….
[Dogmatic mode engaged] ...
Actually, I would contend that the lowly keyboard is a perfect example of why you would queue a DPC for every single keypress. Consider the design pattern that each time a chunk of data arrives from a piece of hardware (keystroke scan code, network packet, etc.) that chunk of data as presented from the hardware is independent and unique from any other chunk of data. Doesn't matter the size of the chunk, the fact that the hardware chose to present it in the manner that it did must be respected and that includes the response to that presentation. There is no "balance" involved, the data and it's presentation are what they are.
Suppose that I had an FPGA that did a DMA transfer of one and only one byte, generating an interrupt for each one ... would it be a good idea to say "I'm going to wait here in my ISR clearing interrupts until I've got a nice 64bytes of data, then push a DPC"?
Go back to the keyboard ... it's sending me a keystroke one at a time, would it make sense to say "nope, I'm going to gather 64 keystrokes here in my ISR before I push it to a DPC"?
When I've got a serial port sending me one character at a time, should I say "well, I'm just going to wait here in my ISR until I've got a good 64 bytes of data before I push to DPC"?
In each of these cases we are going to be making things far, far worse by trying to "agree to a balance" of processing some things in the ISR and try to "save a DPC" (really need to put that on a T shirt).
Back in the wild west days of XP folks did try to "save a DPC" by doing everything in ISR space ... and that's why MS introduced timeout BSOD's for ISR's and DPC's because certain network drivers wanted that extra bit of speed you get by not bothering with returning from the ISR. I remember well porting a Linux network device driver to Windows years ago that had really good performance numbers ... because they did everything, from first interrupt to machine powerdown, inside of a never to be returned from ISR routine ...
The ISR four step dance pattern works well for serial traffic, DMA transfers and keystrokes because it treats each data chunk from the hardware as the hardware presents it as unique from the other chunks ... either as a multi megabyte S/G DMA buffer or one single serial or scan code byte. You can do whatever slicing and dicing you want after you've collected that chunk in a DPC or an APC or a system thread, but IMHO you have to respect the fact that the hardware, for whatever reason, has chosen to present the discrete chunk of data in the manner that it did and you need to respond to that presentation in the same atomic fashion even if it means dispatching a DPC for every single unique chunk it presents ...
[Dogmatic mode disengaged] ... now back to my TB3 woes ...
Very fast typists manage about 10 keystrokes a second. You can handle that with an abacus and a sharp pencil.
Tim Roberts, [email protected]
Providenza & Boekelheide, Inc.
With all due respect, you might consider being more pragmatic. The overhead of requesting the DPC is far higher than whatever is required to put the scan code in a circular buffer and NOT queue the DPC.
This is engineering, not serving happy meals. We need to use our engineering discretion to make appropriate trade offs, with overall efficiency, system throughout, device and user needs, and suitability to the environment all considered and kept in balance. What we do not want to do is slavishly adhere to a rule that someone devised as a way of illustrating basic architectural precepts.
In Windows the architectural precept is that we do things that are truly time critical in our ISR, and then queue a DPCforISR for everything else. Except when that doesn’t meet our needs or make sense. It’s architecture, not an immutable law of physics.
The environment also matters. What’s appropriate on a medical instrument or an aircraft is not appropriate for general purpose PCs. What’s reasonable on a laptop might not make sense on a server (and vice versa).
What matters is understanding the architecture and making appropriate engineering decisions. Not slavish adherence to dogma.
What I've been doing the last couple of days is to re-implement as KMDF. I thought that KMDF was a lower level implementation and not an abstraction on top of WDM and that implementing as WDM was my mistake. If I understand correctly, KMDF is not going to make my driver any faster or more efficient, but it might keep me from making stupid mistakes.
Since I have not completed re-implementation in the format of the RIFFA driver, I'm going to go back to using WDM implementation that is not bugchecking on me and is servicing interrupts.
@Peter_Viscarola_(OSR) I'm not refusing to time the interrupts. I misunderstood the earlier comment on WDM and thought I needed to deal with that first. I don't have an oscilloscope or logic analyzer available and will need to get one as well as working out how to control the GPIO pins on the CPU board; that will take me a little time.
I have verified the hardware is working (at full speed) by testing it with the DOS driver. Sending a packet to the 9054 board initiates the cycle / causes the board to respond, so what an oscilloscope will tell me is whether the interrupt on DMA receipt is being delayed by the framework, or whether the next packet sent to the board is being delayed. It sounded like you were asking how often the board is sending interrupts, and I can guarantee it will be exactly the same rate as entries to the ISR, as the ISR initiates the cycle by commanding the DMA to the 9054 board.
I had no idea 9.5us was an evil thing. The (Microsoft) guidelines I had read only claimed it needed to be less than 25us. This is important information and is something I can almost certainly get more information about. The code is generously littered with TraceEvents statements that are all presumably active given I'm running a debug build and contributing to this time.
What I will do now is remove any debugging I am not directly using and time all the statements in the ISR. After I get that timing I will work out how to set GPIO lines, borrow the oscilloscope from work (my garage has been my lab since Covid), and collect the timing information that Peter has been requesting.
Thank you all for your suggestions and patience.
testing the hardware under a different OS (DOS in this case) does not positively indicate that the rate of interrupt generation will be the same under another.
I’d skip the whole GPIO pin thing, though it would definitely be interesting… if that’ll take you much more time/effort… just clearing the asserted interrupt should tell us enough at a first level of approximation.
What I want to know, OP, from these timings are (1) are you the only one on this interrupt, (2) how frequent are the interrupts from your device, (3) and how long from the time that you clear the interrupt until it’s reasserted.
What I’d really like to see is the relationship between your device sending the interrupt (the processor on your local bus requesting the interrupt) to the time of the interrupt. If you can pick that signal off your device, that’d be nice. In general, bringing out any timing points that could be interesting would be useful. Multiple points and their relationship… that’s why a logic analyzer would be most useful.
It’s not like a logic analyzer trace is going to mystically reveal the underlying issue, either, right? You’re going to have to sit over it and puzzle it out.
I didn’t realize that you wouldn’t have a logic analyzer at hand. It sorta would have been nice if you said that, instead of repeatedly ignoring my suggestion. Of course, what we really want is a bus analyzer trace… but I didn’t figure you’d have one and so I didn’t ask for that.
Until you can hook on the GPIOs, maybe try to log events to memory, using RDTSC(P) for timestamps.
Watch the CPU usage % in the task manager/resource monitor. Does it look reasonable?
High usage can hint that something else is going on there.