Slow PCI read on Win7 vs XP

All excellent points, Peter. I’m a curious fella, so I’d take the time to evaluate those scenarios a little further. Agreed that doing all the processing under DIRQL and not seeing a change in performance would yield maximum levels of suck. However, it might drive me to start questioning hardware configuration a little more seriously. Although I’m not convinced that’s really the issue here considering this sounds like a widely used hardware platform. If anything I could see it being an issue with legacy interrupt emulation as Alex alluded to.

boris.shikhalev@l-3com.com wrote:

>> Note that the error you’re getting can also be caused by using a function that’s
>> not available on XP.
I thought that WDK 7.1.0 with KMDF 1.9 does not have anything that is not supported on XP so if I build with _NT_Target_Version=$(_NT_TARGET_VERSION_WINXP) then I should be fine.

Yes, IF you build using the XP Build Environment. I don’t know what
tool you are using to do your builds, but it takes more than just
_NT_TARGET_VERSION. You also have to have the correct LIB directories
in DDK_LIB_PATH and SDK_LIB_PATH.


Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

Not necessarily.

I just finished a significant project using the Windows 8.1 WDK, and I guilt the driver to target Windows 7. No other changes in the Build Environment necessary (to build KERNEL MODE code… User Mode code is a different story).

The driver ran perfectly well on XP.

Sure. But if you call a WDM-style function that’s not supported on XP (and there are quite a few) the driver won’t load. For example, at one point I unintentionally introduced this problem in my “built for Win7 but runs great on XP” drivers by calling RtlIsNtDdiVersionAvailable or something similarly benign looking.

Peter
OSR
@OSRDrivers

>would even be sufficient in this case to do all the work in the ISR

Yes.

Pre-DMA ATA controller of mid-1990ies and before was using the same pattern of data transfer, and the driver for it was doing the transfer in the ISR.


Maxim S. Shatskih
Microsoft MVP on File System And Storage
xxxxx@storagecraft.com
http://www.storagecraft.com

>Win7 but runs great on XP" drivers by calling RtlIsNtDdiVersionAvailable

This function is the main recommended way of determining the Windows version in drivers for like 10 years.

You can even build for w2k using 6001.18002 and call it, it will work.


Maxim S. Shatskih
Microsoft MVP on File System And Storage
xxxxx@storagecraft.com
http://www.storagecraft.com

Well, I have changed a couple of things and now I can load the new driver on both XP and Win7. The timing on XP is quite different compare to Win7.

Win XP. Interrupts every ~30 milliseconds, DPC executes for ~15 millisecond
Win7. Interrupts every ~30 milliseconds, DPC executes for ~20 milliseconds

XP TraceView output
00022689 src 3284 284 7 22712 11\23\2015-17:35:18:937 –> PLxInterruptHandler
00022690 src 3284 284 7 22713 11\23\2015-17:35:18:937 <– PLxInterruptHandler
00022691 src 3284 284 7 22714 11\23\2015-17:35:18:937 –> Pmc3362EvtInterruptDpc
00022692 src 3284 284 7 22715 11\23\2015-17:35:18:953 <– Pmc3362EvtInterruptDpc
00022693 src 0 0 7 22716 11\23\2015-17:35:18:968 –> PLxInterruptHandler
00022694 src 0 0 7 22717 11\23\2015-17:35:18:968 <– PLxInterruptHandler
00022695 src 0 0 7 22718 11\23\2015-17:35:18:968 –> Pmc3362EvtInterruptDpc
00022696 src 0 0 7 22719 11\23\2015-17:35:18:984 <– Pmc3362EvtInterruptDpc

Win7 TraceView output
00002450 src 6036 6100 3 2450 11\23\2015-17:52:25:496 –> PLxInterruptHandler
00002451 src 6036 6100 3 2451 11\23\2015-17:52:25:496 <– PLxInterruptHandler
00002452 src 6036 6100 3 2452 11\23\2015-17:52:25:496 –> Pmc3362EvtInterruptDpc
00002453 src 6036 6100 3 2453 11\23\2015-17:52:25:516 <– Pmc3362EvtInterruptDpc
00002454 src 0 0 3 2454 11\23\2015-17:52:25:525 –> PLxInterruptHandler
00002455 src 0 0 3 2455 11\23\2015-17:52:25:525 <– PLxInterruptHandler
00002456 src 0 0 3 2456 11\23\2015-17:52:25:525 –> Pmc3362EvtInterruptDpc
00002457 src 0 0 3 2457 11\23\2015-17:52:25:545 <– Pmc3362EvtInterruptDpc

I doubt that moving DPC code to ISR will give me a performance improvement but I will try it anyway.

Thanks.

It seems to me, you guys are looking for a black cat in in a dark room.
And as is a customary in such cases the cat isn’t there.

I mean, on nearly all modern desktop motherboards PCI PIO reads are slow. 500 KTransfers/sec is about the maximum you realistically can hope for. There reason for that is that the bus operation travels too many PCIe-PCIe and PCIe-PCI bridges.

I should say that the situation is not new. Fast PCI reads are thing of the past.
Intel’s last desktop chipset with PCI bus closely coupled to processor was 440BX released in April 1998.
Since then latency of PCI read went steadily up.

Starting from i810 chipset PCI bus was moved from the north bridge to south bridge (a.k.a ICH=I/O Controller Hub). So, CPU talked to North Bridge (=MCH) over Processor Bus (a.k.a. FSB) who in turn talked to ICH via so called Hub Interface (HI) who talked to PCI devices via PCI. And every step added latency. Luckily, HI was relatively simple synchronous link that was running on exact multiply of PCI clock. And it was designed primarily to carry PCI traffic.
So the latency increment relatively to 440BX was moderate.

The same design persisted through several generations of Intel chipsets up to i865 and i875 (with ICH5, 2003). Since during that time the processor bus became faster, it’s reasonable to guess that PCI read latency on i865 is actually faster than on i810 original. But it’s equally reasonable to guess that the difference is small, because FSB part is one of the smallest components of the total latency.

But at that time (2003-2004) AGP died and was replaced by PCIe so Intel redesigned everything around PCIe.
Enter 915 Express. There still were North Bridge (=MCH) and South Bridge (ICH), but instead of HI they were now connected through Direct Media Interface (DMI). DMI is an Intel proprietary variant of PCIe x4 Gen1. The theoretical peak throughput of DMI is 3.75 times higher than old HI (or 7.5 times higher if you count full duplex), but in case of host PCI read peak throughput does not matter at all. What matters is latency. And the latency went up for several reasons. First, within MCH, DMI is not connected to FSB as directly as HI in older chipsets. Instead it goes through PCIe root hub. Second, it crosses the clock domain (FSB<->PCIe) within MCH. Third, it crosses clock domain again within ICH (PCIe<->PCI). Fourth, unlike with HI, legacy PCI traffic was not a main concern of designers of DMA. Etc…

945 Express chipset that is most likely in the heart of your “old”, “fast” system is very similar to 915, except that processor bus is faster yet. But we already said that processor bus is a small part of total latency.

So, now you know why your old “fast” system is so painfully slow. The remaining question is why your new system is slower yet.

The answer is, because on new chipset the PCI bus is even further away from CPU.

As you probably know, starting from Sandy Bridge (a.k.a. 2nd generation core-i7/i5/i3/Pentium/Celeron) Intel integrated PCIe root hub into the same die with CPU. Which is good thing for latency (no FSB to cross), but we already established that it matters little. Majority of latency is elsewhere. And here comes the catch.
In newer chipsets, such as the most popular Lynx Point family (http://ark.intel.com/products/codename/37530/Lynx-Point) the South bridge which, BTW, is still connected to the now integrated North bridge via DMI, no longer has PCI bus. The [rare] desktop motherboards that still have PCI are now getting it via additional PCIe-to-PCI bridge, which most typically is not connected directly to the (integrated) North bridge, but instead it is connected to one of x1 PCIe links hanging from the South bridge. These PCIe-to-PCI bridges are 3rd party and tend to be old designs. One popular example is IDT 89HPEB383. So, relatively to Core2 times, now PCI traffic has one more PCIe bridge to cross. And please don’t forget that in case of host PCI read your transaction is crossing all these bridges at least twice!

What can be possibly done?
The right solution is, of course, is redesign of HW<->SW interface. If your have an FPGA on the other side of PCI9030 it should not be hard. Ideally, you should replace your FIFO by dual-ported memory. Then, you can read it with small number of long READ_REGISTER_BUFFER_ULONG() transfers (or with simple memcpy(). It’s not approved officially by DDK docs, but it will work every single time). Each PCIe transaction running through bridges will carry 16 bytes instead of 4 bytes and your throughput will go up by factor of 4. Or more, if you are lucky.
As a palliative, instead of mapping the whole dual-port RAM into PCI address space you can address it through relatively small sliding window, like 64-byte window or, may be, even 16 bytes.
Sliding the window will cost you PCI write, but in PCIe-based world write transactions are
so much faster than read transactions that they can be considered free. With this scheme you well get more than half of performance benefit of right solution mentioned above without changing your whole memory map.

What can be done if the board can’t be reprogrammed? Not much.
You can try to find relatively new (like 3 y.o.) motherboard with relatively new CPU that has PCI read performance that is similar, or may be, slightly better than your old Core2.
Such boards do exist, they are typically bases on Intel B75 Express Chipset. There are also very
similar P75 and P77 Express Chipsets, but those are likely not available any more.
Here is one possible B75 board - http://www.supermicro.com.tw/products/motherboard/Core/B75/C7B75.cfm

WOW, my first post here on osronline was long.

Boris Shikhalev
xxxxxx@l-3com.com

Win XP. Interrupts every ~30 milliseconds, DPC executes for ~15 millisecond
Win7. Interrupts every ~30 milliseconds, DPC executes for ~20 milliseconds

That’s not a real difference. What you see is just an artifact of different timer interrupt rate on XP vs Win7. Most likely 16msec on the later vs 10 msec on the former.

If you can’t use more precise timing method than, at least, increase the timer interrupt rate.
On XP call timeBeginPeriod(1) from your application to set the period to 1msec. Don’t exit the application, because when you exit Windows will automatically restore default period.
Alternatively, you can start Windows Media Player. That have the same effect.
I don’t know if one or both of this methods work reliably on Win7, but you can try.

>That’s not a real difference.

Maybe so. I certainly agree that the timing traces we’re seeing are lower resolution that I’d like.

Welcome to the list, Mr. Shatz. And that you for such a good premier effort. I *really* enjoyed reading your notes on computer architecture. I suspect you’ll win our coveted Best Post of The Week award with that one.

The right way to do the timing is to call KeQueryPerformanceCounter at the start of the DPC and KeQueryPerformanceCounter at the end of the DPC. Either subtract the two and trace out the answer (AFTER taking the second time stamp) or trace out both and do the math in Excel.

Because you’re running non-dispatchable (in your ISR or your DPC) you don’t have to worry about the differences in the performance counter value on different CPUs (which was a problem in XP, IIRC).

Peter
OSR
@OSRDrivers

>>The right solution is, of course, is redesign of HW<->SW interface. If your have

>an FPGA on the other side of PCI9030 it should not be hard. Ideally, you should
>replace your FIFO by dual-ported memory.
Thank you for the great review Mr. Shatz. I have a FPGA on the board and FIFO to DPRAM conversion is doable. I just did not want to do any FPGA work before all the software possibilities
are exhausted.

>The right way to do the timing is to call KeQueryPerformanceCounter at the start
>of the DPC and KeQueryPerformanceCounter at the end of the DPC
I have not tried the performance counters but I used KeQuerySystemTime at some point and got results consistent with timestamps of ETW only with 100 nanosecond resolution.

The problem is that KeQuerySystemTime suffers from the exact same resolution issues that ETW does… The results are REPORTED in 100ns intervals, but they’re only accurate within 10ms to 15ms. That’s why KeQuerySystemTimePrecise was added in Windows 8 (which uses the performance counter).

But, in any case, it really IS KeQueryPerformanceCounter that you want to use (because it works back to Windows 2000).

Peter
OSR
@OSRDrivers

>>The problem is that KeQuerySystemTime suffers from the exact same resolution

>issues that ETW does… The results are REPORTED in 100ns intervals, but they’re
>only accurate within 10ms to 15ms
Got it. I will try the KeQueryPerformanceCounter.