Right now I am re-implementing without WDM
You can’t. I’m not sure what you really meant by that, but WDM IS the low-level Windows API. You can’t leave home without it. Did you perhaps mean KMDF?
9.5us is a very long time for an ISR.
Right now I am re-implementing without WDM
You can’t. I’m not sure what you really meant by that, but WDM IS the low-level Windows API. You can’t leave home without it. Did you perhaps mean KMDF?
9.5us is a very long time for an ISR.
9.5us is indeed … OP, you’re likely already doing this but the ISR should literally only be doing four things: a) reading a register on the device to determine “what happened”, b) dispatching a DPC containing only the contents of that register, c) writing to another register on the device to clear the interrupt and d) reporting to the OS that the interrupt has been handled. That’s it, four things, six to eight lines of code.
Everything else, everything should be handled in the DPC … out of curiosity I hooked up my trusty Artix board to a logic analyzer to get some timings on the pertinent lines, doing just that on Win10 21H1 on an I7 took about 800ns for that ISR … granted there is a world of difference between PCIe IP running on an FPGA and a PLX9054, but it illustrates that the ISR really should do it’s thing in under a microsecond on modern hardware …
And KMDF really is your friend, even though the RIFFA driver isn’t an exact match it’s still a good match for a starting point …
First KMDF is built on top of WDM. It is highly unlikely that you want to use WDM directly because that API (DDI) is VERY complex to use correctly. KMDF provides all of the boiler plate code that you need to do all of those things properly. It would be wrong to suppose that either model has higher overhead than the other
but the bigger issue is 10us in an ISR. Typical ISR durations are measures in ns. If your measurements are correct, this would exactly explain your performance issues.
So, the OP’s ISR takes way too long… assuming his measurements and math are correct. But:
Everything else, everything should be handled in the DPC .
is a bit too dogmatic, even for me. And not frequently “best practice” either. Consider the lowly traditional keyboard driver. We wouldn’t really want to queue a DPC for every key press, right?
Somewhere between what we at OSR call The Famous Four Steps (things you do in an ISR) and writing all your processing code in your ISR, reasonable engineers can agree a balance.
But back to the OP’s actually issue. We still don’t know how frequently the interrupt is being generated, and he refuses — apparently — to measure this for us, after I have asked him to do so multiple times. So….
Peter
@“Peter_Viscarola_(OSR)” said:
So, the OP’s ISR takes way too long… assuming his measurements and math are correct. But:Everything else, everything should be handled in the DPC .
is a bit too dogmatic, even for me. And not frequently “best practice” either. Consider the lowly traditional keyboard driver. We wouldn’t really want to queue a DPC for every key press, right?
Somewhere between what we at OSR call The Famous Four Steps (things you do in an ISR) and writing all your processing code in your ISR, reasonable engineers can agree a balance.
[Dogmatic mode engaged] …
Actually, I would contend that the lowly keyboard is a perfect example of why you would queue a DPC for every single keypress. Consider the design pattern that each time a chunk of data arrives from a piece of hardware (keystroke scan code, network packet, etc.) that chunk of data as presented from the hardware is independent and unique from any other chunk of data. Doesn’t matter the size of the chunk, the fact that the hardware chose to present it in the manner that it did must be respected and that includes the response to that presentation. There is no “balance” involved, the data and it’s presentation are what they are.
Suppose that I had an FPGA that did a DMA transfer of one and only one byte, generating an interrupt for each one … would it be a good idea to say “I’m going to wait here in my ISR clearing interrupts until I’ve got a nice 64bytes of data, then push a DPC”?
Go back to the keyboard … it’s sending me a keystroke one at a time, would it make sense to say “nope, I’m going to gather 64 keystrokes here in my ISR before I push it to a DPC”?
When I’ve got a serial port sending me one character at a time, should I say “well, I’m just going to wait here in my ISR until I’ve got a good 64 bytes of data before I push to DPC”?
In each of these cases we are going to be making things far, far worse by trying to “agree to a balance” of processing some things in the ISR and try to “save a DPC” (really need to put that on a T shirt).
Back in the wild west days of XP folks did try to “save a DPC” by doing everything in ISR space … and that’s why MS introduced timeout BSOD’s for ISR’s and DPC’s because certain network drivers wanted that extra bit of speed you get by not bothering with returning from the ISR. I remember well porting a Linux network device driver to Windows years ago that had really good performance numbers … because they did everything, from first interrupt to machine powerdown, inside of a never to be returned from ISR routine …
The ISR four step dance pattern works well for serial traffic, DMA transfers and keystrokes because it treats each data chunk from the hardware as the hardware presents it as unique from the other chunks … either as a multi megabyte S/G DMA buffer or one single serial or scan code byte. You can do whatever slicing and dicing you want after you’ve collected that chunk in a DPC or an APC or a system thread, but IMHO you have to respect the fact that the hardware, for whatever reason, has chosen to present the discrete chunk of data in the manner that it did and you need to respond to that presentation in the same atomic fashion even if it means dispatching a DPC for every single unique chunk it presents …
[Dogmatic mode disengaged] … now back to my TB3 woes …
Consider the lowly traditional keyboard driver. We wouldn’t really want to queue a DPC for every key press, right?
Very fast typists manage about 10 keystrokes a second. You can handle that with an abacus and a sharp pencil.
With all due respect, you might consider being more pragmatic. The overhead of requesting the DPC is far higher than whatever is required to put the scan code in a circular buffer and NOT queue the DPC.
This is engineering, not serving happy meals. We need to use our engineering discretion to make appropriate trade offs, with overall efficiency, system throughout, device and user needs, and suitability to the environment all considered and kept in balance. What we do not want to do is slavishly adhere to a rule that someone devised as a way of illustrating basic architectural precepts.
In Windows the architectural precept is that we do things that are truly time critical in our ISR, and then queue a DPCforISR for everything else. Except when that doesn’t meet our needs or make sense. It’s architecture, not an immutable law of physics.
The environment also matters. What’s appropriate on a medical instrument or an aircraft is not appropriate for general purpose PCs. What’s reasonable on a laptop might not make sense on a server (and vice versa).
What matters is understanding the architecture and making appropriate engineering decisions. Not slavish adherence to dogma.
Peter
What I’ve been doing the last couple of days is to re-implement as KMDF. I thought that KMDF was a lower level implementation and not an abstraction on top of WDM and that implementing as WDM was my mistake. If I understand correctly, KMDF is not going to make my driver any faster or more efficient, but it might keep me from making stupid mistakes.
Since I have not completed re-implementation in the format of the RIFFA driver, I’m going to go back to using WDM implementation that is not bugchecking on me and is servicing interrupts.
@“Peter_Viscarola_(OSR)” I’m not refusing to time the interrupts. I misunderstood the earlier comment on WDM and thought I needed to deal with that first. I don’t have an oscilloscope or logic analyzer available and will need to get one as well as working out how to control the GPIO pins on the CPU board; that will take me a little time.
I have verified the hardware is working (at full speed) by testing it with the DOS driver. Sending a packet to the 9054 board initiates the cycle / causes the board to respond, so what an oscilloscope will tell me is whether the interrupt on DMA receipt is being delayed by the framework, or whether the next packet sent to the board is being delayed. It sounded like you were asking how often the board is sending interrupts, and I can guarantee it will be exactly the same rate as entries to the ISR, as the ISR initiates the cycle by commanding the DMA to the 9054 board.
I had no idea 9.5us was an evil thing. The (Microsoft) guidelines I had read only claimed it needed to be less than 25us. This is important information and is something I can almost certainly get more information about. The code is generously littered with TraceEvents statements that are all presumably active given I’m running a debug build and contributing to this time.
What I will do now is remove any debugging I am not directly using and time all the statements in the ISR. After I get that timing I will work out how to set GPIO lines, borrow the oscilloscope from work (my garage has been my lab since Covid), and collect the timing information that Peter has been requesting.
Thank you all for your suggestions and patience.
testing the hardware under a different OS (DOS in this case) does not positively indicate that the rate of interrupt generation will be the same under another.
I’d skip the whole GPIO pin thing, though it would definitely be interesting… if that’ll take you much more time/effort… just clearing the asserted interrupt should tell us enough at a first level of approximation.
What I want to know, OP, from these timings are (1) are you the only one on this interrupt, (2) how frequent are the interrupts from your device, (3) and how long from the time that you clear the interrupt until it’s reasserted.
What I’d really like to see is the relationship between your device sending the interrupt (the processor on your local bus requesting the interrupt) to the time of the interrupt. If you can pick that signal off your device, that’d be nice. In general, bringing out any timing points that could be interesting would be useful. Multiple points and their relationship… that’s why a logic analyzer would be most useful.
It’s not like a logic analyzer trace is going to mystically reveal the underlying issue, either, right? You’re going to have to sit over it and puzzle it out.
I didn’t realize that you wouldn’t have a logic analyzer at hand. It sorta would have been nice if you said that, instead of repeatedly ignoring my suggestion. Of course, what we really want is a bus analyzer trace… but I didn’t figure you’d have one and so I didn’t ask for that.
Peter
Until you can hook on the GPIOs, maybe try to log events to memory, using RDTSC(P) for timestamps.
Watch the CPU usage % in the task manager/resource monitor. Does it look reasonable?
High usage can hint that something else is going on there.
/* I cannot understand what the Intel folks were thinking about helping external developers.
Core and even Atom class processors and their bridges are packed with features, and developers are struggling to reach to the gpios.
Meanwhile, almost every tiny, dirt cheap Cortex-M chip has sort of simple logic analyzer (ITM), accessible over cheap 3-pin interfaces.
(well, thru yet another dirt cheap USB dongle).
*/
The comments on this board have been incredibly helpful to me.
Telling me that 9.5us was very wrong led me to carefully time each part of the code to find the problem. The sample I was using contained macros that translate to READ/WRITE_REGISTER_ULONG, and each of those instructions was taking ~2us. I don’t believe I need a memory barrier in my design, changed to using normal access, and that removed nearly all delay.
Telling me I was misunderstanding or misinterpreting what I saw got me to start changing individual settings to verify I was affecting anything: if it is difficult to get something to work, it should be easy to break parts and at least tell that I’m doing what I think I’m doing.
One of the things I did was to intentionally fail to reset the interrupt in the ISR while leaving the rest of the logic in place such that after I return from the interrupt the system will immediately interrupt again. I expected the system to crash, but instead I found my diagnostic messages that print every 5,000 interrupts were printing approximately every 0.045 seconds based on the timestamps in the debug output. I calculate that to mean my driver running on a low-end i3 can service over 100,000 interrupts per second or approximately 10 times the rate I require - while in debug mode and sending diagnostics.
I have sequence counters that verify to me the interrupt really can be serviced over 100,000 / seconds and wanted to share that; it was incredibly good news to me.
In addition, I found some of my settings are having no effect at all, and I now know where to concentrate.
@Pavel_A : I have been logging timestamps from KeQueryPerformanceCounter to an array in the Extension, examining them every 5,000 interrupts, and sending analysis to DbgView with DbgPrintf. Resource Monitor was indicating all cores were idle previously. When I use my intentionally broken driver that continuously interrupts I see 16-20% of the CPU being used by System Interrupts, and overall CPU usage of 6-7%. GPU usage goes from 0 to 30% when I engage the continuous interrupt driver, presumably from all the updating of DbgView.
@“Peter_Viscarola_(OSR)” : 1) I believe I am the only device on this PCI interrupt based on the following: I was able to set the interrupt as exclusive in the call to IoCreateDevice without any repercussions, and there are only two boards in the system (the CPU and the 9054 board). The interrupt assigned is 16-19 depending on which slot I place the 9054 board as expected and the slot does not impact performance.
2) The interrupts should be every 100us or 10,000 / second. They are currently arriving at 1.6ms intervals. I believe what I’m seeing is some sort of timeout rather than normal operation. The packet from the board should be in response to my sending a DMA packet to the board, and that is the part of my driver that is making no difference; I don’t think I’m actually sending a packet.
3) ~1.6ms
What I suspect I have done is relied on a shortcut for scheduling the DMA to the board (that works in the DOS driver), and I need to be more meticulous in my setup.
Thank you once again for all the assistance. It has made a world of difference.
The sample I was using contained macros that translate to READ/WRITE_REGISTER_ULONG, and each of those instructions was taking ~2us.
How are you measuring that? READ_REGISTER_ULONG becomes nothing more than a single CPU read instruction that becomes a bus cycle. Even on a Gen1 PCIe bus, that should be single-digit nanoseconds, unless your hardware is stalling.
@Tim_Roberts : slightly modified to use what VS2019 says the macros translate to and to make it a code snippet. The elapsed time was 20 or 21 for that line consistently.
pdx = (DEVICE_EXTENSION *)ServiceContext;
pdx->TickArray[pdx->ulIntCount++] = KeQueryPerformanceCounter(NULL); // Record when we entered
READ_REGISTER_ULONG(
pdx->pRegVa + PCI9054_INT_CTRL_STAT
);
if (pdx->ulIntCount >= 5000) // Every 5,000 interrupts
{
{
LARGE_INTEGER l1;
int l3;
l1 = KeQueryPerformanceCounter(NULL);
l3 = l1.LowPart - pdx->TickArray[4999].LowPart;
DbgPrintf("Elapsed time: %d ticks.", l3); // Reading that this happened 21 ticks later
}
pdx->ulIntCount = 0;
}
@Tim_Roberts : modified slightly to make it a code snippet and translate the macros per VS2019 hint
pdx = (DEVICE_EXTENSION *)ServiceContext;
pdx->TickArray[pdx->ulIntCount++] = KeQueryPerformanceCounter(NULL); // Record when we entered
READ_REGISTER_ULONG((U32*)((pdx)->pRegVa + (PCI9054_INT_CTRL_STAT)));
if (pdx->ulIntCount >= 5000) // Every 5,000 interrupts
{
{
LARGE_INTEGER l1;
int l3;
l1 = KeQueryPerformanceCounter(NULL);
l3 = l1.LowPart - pdx->TickArray[4999].LowPart;
DbgPrint("Elapsed time: %d ticks.", l3); // Consistently prints 20 or 21
}
pdx->ulIntCount = 0;
}
I have succeeded. It was a setup issue. Thank you all once again.
Yay! I’m glad you got stuff to work!
But…C’mon dude. A lot of folks here spent a lot of time trying to help you. Maybe you can tell us a bit more than “It was a setup issue”??
Peter
I did want to comment, however, on your timing code. When KeQueryPerformanceCounter
returns 20 or 21, that does NOT mean 2 microseconds. It does not return 100ns units. You have to divide that number by the counter frequency, which returned in the pointer you pass to KQPC, which you are passing as NULL. That number varies a fair amount based on operating system and sometimes on hardware.
@“Peter_Viscarola_(OSR)” I had missed part of the initialization sequence. A command packet instructs the board (there is no manual) to change the interrupt frequency. The driver was working as expected, but the 9054 board was not properly configured. In the DOS driver it was intentionally commanded to a maximum delay to allow the initialization to complete then later changed to the desired frequency. I was seeing that maximum delay, which was why it was so consistent and not at all random.
@Tim_Roberts I did measure the granularity and got 1.0E7.
’ DbgPrint(“Perf Counter Ticks Per Second = %d:%d”, tps.HighPart, tps.LowPart); ’
returned ‘= 0:10000000’