Hello all,
I have a PLX9030 based PCI data acquisition card and 32-bit XP kernel driver that reads data FIFO from the BAR4 and places data to the application’s ring buffer. The original driver was written using Numega DriwerWorks framework. I re-wrote the driver using WDF to work on Win 7 32 and 64-bit. The new driver works fine except that it reads data much slower than the XP’s one. The reading is a simple PIO operation and exactly the same in both XP and Win7 drivers.
It does not depend on which flavor I use both 32 and 64-bit are slow (it takes 30 ms to copy 64KB data from the FIFO register to the ring buffer and consumes ~70% of CPU core). I cannot benchmark the old driver since I cannot rebuild it but it consumes at least 2 time less of CPU resources.
I am wondering why I see such a difference?
Thanks.
Which macro are you using to read the BAR?
boris.shikhalev@l-3com.com wrote:
I have a PLX9030 based PCI data acquisition card and 32-bit XP kernel driver that reads data FIFO from the BAR4 and places data to the application’s ring buffer. The original driver was written using Numega DriwerWorks framework. I re-wrote the driver using WDF to work on Win 7 32 and 64-bit. The new driver works fine except that it reads data much slower than the XP’s one. The reading is a simple PIO operation and exactly the same in both XP and Win7 drivers.
It does not depend on which flavor I use both 32 and 64-bit are slow (it takes 30 ms to copy 64KB data from the FIFO register to the ring buffer and consumes ~70% of CPU core). I cannot benchmark the old driver since I cannot rebuild it but it consumes at least 2 time less of CPU resources.
I am wondering why I see such a difference?
How are you doing the copy, and how are you doing the timing? Is the
data exposed in exactly one register? Have you mapped the BAR as
uncached? Are the ioctl codes the same (so, both direct or both buffered)?
–
Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.
The data exposed in only one register. Copying is done using READ_REGISTER_ULONG. The BAR mapped as uncached. IOCTLs are not issued during data transfer. They used only to start data acquisition when the application starts and stop when it exits.
The 9030 is a pretty old part. IIRC, it’s pretty easy to set this device up “wrong” and get it to perform badly.
Are you sure you’re setting it up EXACTLY the same way as you were in the DriverWorks driver? Your initialization code for the part should be pretty-much identical.
There’s nothing special, mystical, or unique that DriverWorks could be doing to make the code faster. The answer must be something simple.
30ms for 64K is, indeed, pretty darn slow. Yuck.
It would be interesting to have answers to Mr. Robert’s questions:
- How are you doing the copy (where is the data read from the device stored)?
- How are you doing the timing.
- Is the data read from a single register?
To this I’ll add: Are you SURE that the device is working under EXACTLY the same conditions with both drivers?
Peter
OSR
@OSRDrivers
Use READ_REGISTER_BUFFER_ULONG, if the buffer occupies a range in the BAR. READ_REGISTER_ULONG will have a memory barrier after each read.
Is the register memory-mapped IO? If not, you should use the READ_PORT_BUFFER_ULONG instead.
-----Original Message-----
From: xxxxx@lists.osr.com [mailto:bounce-595852-
xxxxx@lists.osr.com] On Behalf Of boris.shikhalev@l-3com.com
Sent: Tuesday, November 17, 2015 11:29 AM
To: Windows System Software Devs Interest List
> Subject: RE:[ntdev] Slow PCI read on Win7 vs XP
>
> The data exposed in only one register. Copying is done using
> READ_REGISTER_ULONG. The BAR mapped as uncached. IOCTLs are not issued
> during data transfer. They used only to start data acquisition when the
> application starts and stop when it exits.
>>Is the data read from a single register?
Yes
>Use READ_REGISTER_BUFFER_ULONG, if the buffer occupies a range in the BAR.
I cannot use READ_REGISTER_BUFFER_ULONG because I essentially read from one register all the time.
>How are you doing the copy (where is the data read from the device stored)?
I have created a test version of the driver that reads FIFO into a dummy variable allocated within the driver to eliminate copying to application’s buffer influence. Still slow.
>How are you doing the timing.
Do you mean ISR-DPC?
On Tue, Nov 17, 2015 at 2:34 PM, wrote:
> Use READ_REGISTER_BUFFER_ULONG, if the buffer occupies a range in the BAR.
> READ_REGISTER_ULONG will have a memory barrier after each read.
I think he said it is a fifo so it doesn’t occupy a range but instead a
single address.
Mark Roddy
boris.shikhalev@l-3com.com wrote:
>> How are you doing the timing.
Do you mean ISR-DPC?
No, I mean how are you measuring 30ms for 64kB? Where are you
measuring, and what are you using?
–
Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.
HOW slow?
Mr. Shikhalev, with all due respect, if you want us to be able to provide you with the most possible help, you *really* do need to answer the questions we’re asking you.
If you don’t understand some of these questions, please say so and we’ll be happy to explain further. That’s preferable to your simply ignoring the questions, and leaving us to wonder (or, worse, become frustrated).
Peter
OSR
@OSRDrivers
I use ETW traces timestamps on entry and exit of DPC.
Peter,
I am sorry for giving somehow misleading information or missing it at all.
I have quite limited knowledge of kernel drivers development.
Saying “slow” I meant that I did not see any performance change compare to the version that copies data to the application’s ring buffer.
I can make only very rough estimate of execution speed since I use ETW traces timestamps for it.
That’s fine… we’re here to help.
I’ll ask my previous questions again, given that you must have missed them:
The 9030 is a pretty old part. IIRC, it’s pretty easy to set this device up
“wrong” and get it to perform badly.
Are you sure you’re setting it up EXACTLY the same way as you were in the
DriverWorks driver? Your initialization code for the part should be pretty-much
identical.
To this I’ll add: Are you SURE that the device is working under EXACTLY the same
conditions with both drivers? EXACT same system, EXACTLY the same workload?
New questions:
So, you’re reading the registers in your DPC… and the 30ms is the time it takes for your entire DPC to run? Because the time in your DPC can vary a lot (considering other system load).
If you do the test 1,000 times, what are you highest and lowest times (reading into the buffer AND reading into a dummy variable).
Peter
OSR
@OSRDrivers
The 9030 initialization is the same for XP and Win7 drivers
The systems are different - XP is Core2 Duo 2.2 GHz machine and Win7 is Core I7 3GHz.
The data payload is the same. I use the same data source.
Yes 30ms is the time it takes for your entire DPC to run. It is pretty consistent.
The 28 ms - 32 ms range is the same for for both dummy variable and the buffer
The same PCI board is used for XP and Win7 tests.
Thank you, Mr. Shikhalev – The additional data is much appreciated.
Hmmmm… so, on a machine that should be 50% faster, things are *much* slower and the CPU utilization is *much* higher.
This really looks like something is happening on the Win7 system that’s NOT happening on the XP system. OR, the device is having compatibility problems on the new machine. Like I said, the 9030 is a pretty old part… and was designed for the older PCI bus. Still, it SHOULD work.
Suggestion #1:
Any chance you can run the two systems on exactly the same hardware? I’m very suspicious of the environment.
Suggestion #2:
Another (architecturally invalid but useful) test you could run – just to eliminate any differences between what DriverWorks might be doing and doing it “the right way” in Windows – would be that instead of using READ_REGISTER_ULONG you just dereference the register to read it’s contents. So, instead of:
someVar = READ_REGISTER_ULONG(MyRegister);
You do:
someVar = *MyRegister;
And get the timings. THAT would be interesting.
Peter
OSR
@OSRDrivers
>>Any chance you can run the two systems on exactly the same hardware?
Can be done if I get two identical PCs. I will try.
>someVar = READ_REGISTER_ULONG(MyRegister); vs someVar = *MyRegister;
Already tried yesterday that and the timing was identical.
Thanks.
>>>Any chance you can run the two systems on exactly the same hardware?
Can be done if I get two identical PCs. I will try.
My own golden rule is :
"if you compare software performance on different OS versions , do it on the same machine ( same CPU , same RAM memory type and same
amount )
In your case , you need to be able to multiboot intoWin 7 and Xp on the same machine.
Christiaan
----- Original Message -----
From: <boris.shikhalev>
To: “Windows System Software Devs Interest List”
Sent: Tuesday, November 17, 2015 11:46 PM
Subject: RE:[ntdev] Slow PCI read on Win7 vs XP
>>>Any chance you can run the two systems on exactly the same hardware?
> Can be done if I get two identical PCs. I will try.
>
>>>someVar = READ_REGISTER_ULONG(MyRegister); vs someVar = *MyRegister;
> Already tried yesterday that and the timing was identical.
>
> Thanks.
>
> —
> NTDEV is sponsored by OSR
>
> Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev
>
> OSR is HIRING!! See http://www.osr.com/careers
>
> For our schedule of WDF, WDM, debugging and other seminars visit:
> http://www.osr.com/seminars
>
> To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer</boris.shikhalev>