Hello,
My device driver maps RAM of external device connected to my machine via PCI
exprerss.
I want to use WdfMemoryCopyFromBuffer , WdfMemoryCopyToBuffer to copy date
from/to this external RAM.
Can you say if those routines use 128bits copy for optimal performance ?
Is it possible to see the assembly code of a WDF device driver developed in
C (without debugger) ?
Thanks,
Zvika
That sounds remarkably like Remote DMA, but I was doing that on a fibre channel card 10 hears ago and it may not be the same.
Sure. Set a compile switch of /FA, there are variants of that so look a the documentation for the compiler. That will give you all the assembly you could possibly hope for.
Gary Little
H (952) 223-1349
C (952) 454-4629
xxxxx@comcast.net
On Mar 9, 2012, at 2:39 PM, Zvi Vered wrote:
Hello,
My device driver maps RAM of external device connected to my machine via PCI exprerss.
I want to use WdfMemoryCopyFromBuffer , WdfMemoryCopyToBuffer to copy date from/to this external RAM.
Can you say if those routines use 128bits copy for optimal performance ?
Is it possible to see the assembly code of a WDF device driver developed in C (without debugger) ?
Thanks,
Zvika
NTDEV is sponsored by OSR
For our schedule of WDF, WDM, debugging and other seminars visit: http://www.osr.com/seminars
To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer
Zvi Vered wrote:
My device driver maps RAM of external device connected to my machine via PCI
exprerss.
I want to use WdfMemoryCopyFromBuffer , WdfMemoryCopyToBuffer to copy date
from/to this external RAM.
Can you say if those routines use 128bits copy for optimal performance ?
CPU-based copies never use 128 bit transfers. Until you get to the
PCIExpress root complex, the buses are all parallel, and there aren’t
enough pins for that.
In a 32-bit system, they use “rep movsd”, which is 32 bits at a time.
In a 64-bit system, they use “rep movsq”, which is 64 bits at a time.
It is POSSIBLE for your root complex to combine consecutive writes into
a larger packet, but in my experience, they do not do so. You have to
use bus mastering to do that.
Is it possible to see the assembly code of a WDF device driver developed in
C (without debugger) ?
Yes, but it doesn’t help, because you don’t have the source for WDF itself.
–
Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.
>I want to use WdfMemoryCopyFromBuffer , WdfMemoryCopyToBuffer to copy date from/to this external RAM.
Can you say if those routines use 128bits copy for optimal performance ?
Is it possible to see the assembly code of a WDF device driver developed in C (without debugger) ?
Just for background on current optimizations of memory copy:
http://software.intel.com/en-us/articles/increasing-memory-throughput-with-intel-streaming-simd-extensions-4-intel-sse4-streaming-load/
http://www.intel.com/technology/itj/2008/v12i3/3-paper/5-streaming.htm
The article suggests a copy performance number of about 8.53 GBytes/sec for a 1067 FSB speed. It also says you need to run in parallel across multiple cores to get optimal performance. Each MOVNTDQA does transfer 16 bytes, and unrolling a loop to 4 registers in the pipeline causes an efficient cache line size transfer, using special processor streaming buffers.
I guess the question to Microsoft would be, what is the best way to get optimal memory copy performance using KMDF, and how does that performance compare with processor optimized methods.
Jan
Well first you need to remember a rule that is older than Intel. If you
optimize for the latest processor you decrease performance for the
existing customer base. This has been a challenge for compiler and
library writers for close to 50 years. Si it is doubtful that Microsoft
will optimize kernel libraries based on Intel’s latest, since it would
probably degrade older system let alone AMD.
But you last sentence is the goal. Hopefully, Microsoft can provide
guidelines on the best way to get performance for this and probably a
number of other processor constrained scenarios.
Don Burn
Windows Filesystem and Driver Consulting
Website: http://www.windrvr.com
Blog: http://msmvps.com/blogs/WinDrvr
“Jan Bottorff” wrote in message
news:xxxxx@ntdev:
> >I want to use WdfMemoryCopyFromBuffer , WdfMemoryCopyToBuffer to copy date from/to this external RAM.
> >Can you say if those routines use 128bits copy for optimal performance ?
> >Is it possible to see the assembly code of a WDF device driver developed in C (without debugger) ?
>
> Just for background on current optimizations of memory copy:
>
> http://software.intel.com/en-us/articles/increasing-memory-throughput-with-intel-streaming-simd-extensions-4-intel-sse4-streaming-load/
>
> http://www.intel.com/technology/itj/2008/v12i3/3-paper/5-streaming.htm
>
> The article suggests a copy performance number of about 8.53 GBytes/sec for a 1067 FSB speed. It also says you need to run in parallel across multiple cores to get optimal performance. Each MOVNTDQA does transfer 16 bytes, and unrolling a loop to 4 registers in the pipeline causes an efficient cache line size transfer, using special processor streaming buffers.
>
> I guess the question to Microsoft would be, what is the best way to get optimal memory copy performance using KMDF, and how does that performance compare with processor optimized methods.
>
> Jan