How to copy memory more quickly?

Dear ALL:
I will develop a driver for a PCI-Ex1 device. I will copy the data from driver’s buffer to
the memory on the PCI-E device. The memory of the device is mapped to the windows.
I use the function RtlCopyMemory to do the copy action. However, the speed is not fast.

I measure that the transfer rate is about
writing data to the device: 18 MB/sec
reading data from the device:3 MB/sec

I use the following equipment.

Intel Core2 Duo E6750 2.66G
DDR2-800 1G
Intel P35

Is there any way to make the copy faster? Is it possible to reach 100 MB/sec? Thanks!

xxxxx@hotmail.com wrote:

I will develop a driver for a PCI-Ex1 device. I will copy the data from driver’s buffer to
the memory on the PCI-E device. The memory of the device is mapped to the windows.
I use the function RtlCopyMemory to do the copy action. However, the speed is not fast.

I measure that the transfer rate is about
writing data to the device: 18 MB/sec
reading data from the device:3 MB/sec

I would expect you to do a little better than that. Are you sure your
device is able to keep up at that rate?

I use the following equipment.

Intel Core2 Duo E6750 2.66G
DDR2-800 1G
Intel P35

Is there any way to make the copy faster? Is it possible to reach 100 MB/sec? Thanks!

Yes, but only by having your device be a bus master. DMA is the only
way to get maximum throughput on PCIe.


Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

You could perform an aligned memory copy.

Cheers,
Matt

Hi Felix,
Tim is right, doing a RtlCopyMemory just uses the CPU do loop through a transfer, very inefficient. I have not done a PCI-E project yet, but I am sure it is similar to PCI. We do shared memory DMA with our device. The actual bus cycle in the PCI spec is referred to as a “burst transfer”. It takes quite a few less bus cycles, and in our case requires no CPU intervention since it is initiated by the card itself.

-Justin

Dear Tim:
My device have the DMA. I use the DMA to copy the data and the tranfer rate is faster than the CPU’s movement. However, the CPU is powerful but the transfer rate is so slow.
I want to try the find a way to copy the data more quickly. Is there another way to do the copy by PC’s CPU? If I can find it, it will be userfu. for developing my application.
By the way, is there any DMA on the North or Sourth bridge I can use? Thanks for your help.

Felix

Date: Fri, 30 May 2008 09:52:51 -0700> From: xxxxx@probo.com> To: xxxxx@lists.osr.com> Subject: Re: [ntdev] How to copy memory more quickly?> > xxxxx@hotmail.com wrote:> > I will develop a driver for a PCI-Ex1 device. I will copy the data from driver’s buffer to> > the memory on the PCI-E device. The memory of the device is mapped to the windows.> > I use the function RtlCopyMemory to do the copy action. However, the speed is not fast.> >> > I measure that the transfer rate is about > > writing data to the device: 18 MB/sec> > reading data from the device:3 MB/sec> > > > I would expect you to do a little better than that. Are you sure your > device is able to keep up at that rate?> > > I use the following equipment.> >> > Intel Core2 Duo E6750 2.66G> > DDR2-800 1G> > Intel P35> >> > Is there any way to make the copy faster? Is it possible to reach 100 MB/sec? Thanks!> > > > Yes, but only by having your device be a bus master. DMA is the only > way to get maximum throughput on PCIe.> > – > Tim Roberts, xxxxx@probo.com> Providenza & Boekelheide, Inc.> > > —> NTDEV is sponsored by OSR> > For our schedule of WDF, WDM, debugging and other seminars visit: > http://www.osr.com/seminars\> > To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer


ÀH¨­ªº Windows Live Messenger ©M Hotmail¡A¤£­­®É¦a´x´¤¸ê°TºÉ¦b«ü¶¡ ¡X Windows Live for Mobile
http://www.msn.com.tw/msnmobile/

Dear Tim: My device have the DMA. I use the DMA to copy the data and the tranfer rate is faster than the CPU’s movement. However, the CPU is powerful but the transfer rate is so slow. I want to try the find a way to copy the data more quickly. Is there another way to do the copy by PC’s CPU? If I can find it, it will be userfu. for developing my application. By the way, is there any DMA on the North or Sourth bridge I can use? Thanks for your help. Felix

Date: Fri, 30 May 2008 09:52:51 -0700> From: xxxxx@probo.com> To: xxxxx@lists.osr.com> Subject: Re: [ntdev] How to copy memory more quickly?> > xxxxx@hotmail.com wrote:> > I will develop a driver for a PCI-Ex1 device. I will copy the data from driver’s buffer to> > the memory on the PCI-E device. The memory of the device is mapped to the windows.> > I use the function RtlCopyMemory to do the copy action. However, the speed is not fast.> >> > I measure that the transfer rate is about > > writing data to the device: 18 MB/sec> > reading data from the device:3 MB/sec> > > > I would expect you to do a little better than that. Are you sure your > device is able to keep up at that rate?> > > I use the following equipment.> >> > Intel Core2 Duo E6750 2.66G> > DDR2-800 1G> > Intel P35> >> > Is there any way to make the copy faster? Is it possible to reach 100 MB/sec? Thanks!> > > > Yes, but only by having your device be a bus master. DMA is the only > way to get maximum throughput on PCIe.> > – > Tim Roberts, xxxxx@probo.com> Providenza & Boekelheide, Inc.> > > —> NTDEV is sponsored by OSR> > For our schedule of WDF, WDM, debugging and other seminars visit: > http://www.osr.com/seminars\> > To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer


¥Î³¡¸¨®æ¤À¨É·Ó¤ù¡B¼v­µ¡B½ì¨ý¤p¤u¨ã©M³Ì·R²M³æ¡AºÉ±¡¨q¥X§A¦Û¤v ¡X Windows Live Spaces
http://spaces.live.com/

A long list of random ideas that might help…

RtlCopyMemory is inappropriate for accessing mapped device memory, you
should use one of the transfer macros like READ_REGISTER_BUFFER_ULONG64.

You might want to assure your device BAR allows prefetch read access.
Uncached access may force full bus cycles for each 32/64 bits instead of
initiating a burst transfer. You might also use write combining attributes
on the memory mapping for your writes, this usually made writes to video
cards much faster. Make sure your transfers are aligned on a good value,
like a cache line.

Some devices have a very long latency, but once a burst starts, can move a
good size block of data. A single PCI-e 1.1 lane has a theoretical bandwidth
of about 250 Mbytes/sec (per direction). To achieve the 100 Mbytes/sec you
asked about would require 40% of the theoretical bandwidth, which my gut
feeling (which may be totally wrong) says might be difficult for a target
mode device. The fact that your getting 18/3 Mbytes/sec (write/read) makes
me think your device probably has significant latency, so getting the burst
size up will be critical. Those numbers seem so low I wonder if there is
perhaps error retry happening, although I don’t have any experience with x1
PCI-e devices.

There have been MANY threads here on ntdev over the years on how to do fast
PCI target mode access. The answers usually are: 1) it depends a LOT on your
motherboard chipset, 2) bus master transfers work a LOT better.

The best way to debug what’s really happening will probably require a PCI-e
bus analyzer. I remember the first time I used a PCI bus analyzer to
optimize a driver, it was VERY enlightening (or depressing might be more
correct).

SSE instructions are also capable of 128-bit wide transfers, which might
cause bursting when integer register reads/writes don’t.

I’d also recommend you read a book on PCI-e or talk to a hardware engineer
and figure out what burst size you will need on a x1 interface to achieve
the bandwidth you need.

Another possibility would be to get some other bus master to use you card as
a target, which can also cause large burst transfers. Potential bus masters
include disk controllers, video controllers, and special memory copy
hardware like the Intel IOAT on some motherboards.

Modern Intel (and assume AMD) processors have a number of cache
prefetch/manupulation instructions that might alter the burst behaivor.

Jan

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of
xxxxx@hotmail.com
Sent: Friday, May 30, 2008 9:33 AM
To: Windows System Software Devs Interest List
Subject: [ntdev] How to copy memory more quickly?

Dear ALL:
I will develop a driver for a PCI-Ex1 device. I will
copy the data from driver’s buffer to the memory on the PCI-E
device. The memory of the device is mapped to the windows.
I use the function RtlCopyMemory to do the copy action.
However, the speed is not fast.

I measure that the transfer rate is about writing data to the
device: 18 MB/sec reading data from the device:3 MB/sec

I use the following equipment.

Intel Core2 Duo E6750 2.66G
DDR2-800 1G
Intel P35

Is there any way to make the copy faster? Is it possible to
reach 100 MB/sec? Thanks!


NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online
at http://www.osronline.com/page.cfm?name=ListServer

I assume you have a PCI-E controller with some local SDRAM. You want to transfer data from host memory to controller memory. The fastest method is using your controller DMA.
The DMA in host side is used to transfer data within host memory to save CPU cycle. Usually that DMA wouldn’t be used for transferring data to controller memory. Even if u try that (after mapping your controller memory to host address) , I don’t think it’s would be faster than the previous one.
How big the data is? How many bytes you want to access on that memory? You can reserve some host memory and map them locally. So you don’t need to copy , rather you can treat the host memory as local memory. Remember still the access is across PCI device. But it’s useful if you want to access few bytes.

Dear Jan:
Thanks for your explanation in detail. I have some questions:

  1. DO I use READ_REGISTER_BUFFER_ULONG64 instead of RtlCopyMemory? My device is 32-bit.Should I use READ_REGISTER_BUFFER_ULONG?
  2. What is the “write combining attributes on the memory mapping for your writes”? Would you give me a sample code?
  3. Is there any common bus masters I can use? Is there any API? Because my device must be compatible with most computer, I need the find a common bust master to do
    this. Thanks for your kindly help.

Felix


From: xxxxx@pmatrix.com
To: xxxxx@lists.osr.com
Subject: RE: [ntdev] How to copy memory more quickly?
Date: Fri, 30 May 2008 23:11:45 -0700

A long list of random ideas that might help…

RtlCopyMemory is inappropriate for accessing mapped device memory, you
should use one of the transfer macros like READ_REGISTER_BUFFER_ULONG64.

You might want to assure your device BAR allows prefetch read access.
Uncached access may force full bus cycles for each 32/64 bits instead of
initiating a burst transfer. You might also use write combining attributes
on the memory mapping for your writes, this usually made writes to video
cards much faster. Make sure your transfers are aligned on a good value,
like a cache line.

Some devices have a very long latency, but once a burst starts, can move a
good size block of data. A single PCI-e 1.1 lane has a theoretical bandwidth
of about 250 Mbytes/sec (per direction). To achieve the 100 Mbytes/sec you
asked about would require 40% of the theoretical bandwidth, which my gut
feeling (which may be totally wrong) says might be difficult for a target
mode device. The fact that your getting 18/3 Mbytes/sec (write/read) makes
me think your device probably has significant latency, so getting the burst
size up will be critical. Those numbers seem so low I wonder if there is
perhaps error retry happening, although I don’t have any experience with x1
PCI-e devices.

There have been MANY threads here on ntdev over the years on how to do fast
PCI target mode access. The answers usually are: 1) it depends a LOT on your
motherboard chipset, 2) bus master transfers work a LOT better.

The best way to debug what’s really happening will probably require a PCI-e
bus analyzer. I remember the first time I used a PCI bus analyzer to
optimize a driver, it was VERY enlightening (or depressing might be more
correct).

SSE instructions are also capable of 128-bit wide transfers, which might
cause bursting when integer register reads/writes don’t.

I’d also recommend you read a book on PCI-e or talk to a hardware engineer
and figure out what burst size you will need on a x1 interface to achieve
the bandwidth you need.

Another possibility would be to get some other bus master to use you card as
a target, which can also cause large burst transfers. Potential bus masters
include disk controllers, video controllers, and special memory copy
hardware like the Intel IOAT on some motherboards.

Modern Intel (and assume AMD) processors have a number of cache
prefetch/manupulation instructions that might alter the burst behaivor.

Jan

> -----Original Message-----
> From: xxxxx@lists.osr.com
> [mailto:xxxxx@lists.osr.com] On Behalf Of
> xxxxx@hotmail.com
> Sent: Friday, May 30, 2008 9:33 AM
> To: Windows System Software Devs Interest List
> Subject: [ntdev] How to copy memory more quickly?
>
> Dear ALL:
> I will develop a driver for a PCI-Ex1 device. I will
> copy the data from driver’s buffer to the memory on the PCI-E
> device. The memory of the device is mapped to the windows.
> I use the function RtlCopyMemory to do the copy action.
> However, the speed is not fast.
>
> I measure that the transfer rate is about writing data to the
> device: 18 MB/sec reading data from the device:3 MB/sec
>
>
> I use the following equipment.
>
> Intel Core2 Duo E6750 2.66G
> DDR2-800 1G
> Intel P35
>
> Is there any way to make the copy faster? Is it possible to
> reach 100 MB/sec? Thanks!
>
>
> —
> NTDEV is sponsored by OSR
>
> For our schedule of WDF, WDM, debugging and other seminars visit:
> http://www.osr.com/seminars
>
> To unsubscribe, visit the List Server section of OSR Online
> at http://www.osronline.com/page.cfm?name=ListServer
>


NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer


5 GB ¶W¤j®e¶q ¡B³Ð·s«K±¶¡B¦w¥þ¨¾Å@©U§£¶l¥ó©M¯f¬r ¡X ¥ß§Y¤É¯Å Windows Live Hotmail
http://mail.live.com

> Is there any way to make the copy faster? Is it possible to reach 100 MB/sec?

No.

To reach such speed, you need a different hardware design, i.e. no memory on
the device, the DMA busmaster on the device accessing the main RAM.

DMA is much faster then CPU-initiated transfers.

This will also reduce the manufacturing costs of the device.


Maxim Shatskih, Windows DDK MVP
StorageCraft Corporation
xxxxx@storagecraft.com
http://www.storagecraft.com

> I want to try the find a way to copy the data more quickly. Is there another
way to

do the copy by PC’s CPU?

No, use DMA.

When the CPU accesses the addresses on the PCI target, each 32/64bit word is
transferred as a PCI cycle - inter-cycle time gap, address phace and data
phase. This is slow.

DMA uses burst cycles - i.e. 1 inter-cycle time gap, 1 address phace and then
lots of data phases, this means - executing memcpy() as single long PCI cycle.
This is much faster - a 32/64bit word per PCI clock period.

Also note that on some PCI derivatives the sequence of data phases of the burst
transfer can be executed using a 2-4times faster clock rate then the PCI clock.
At least AGP could do this, probably PCIe too.


Maxim Shatskih, Windows DDK MVP
StorageCraft Corporation
xxxxx@storagecraft.com
http://www.storagecraft.com

> The DMA in host side is used to transfer data within host memory to save CPU

cycle.

Sorry, what particular DMA do you mean? With PCI, there is no “DMA in host
side”, and, if you mean the ancient 8237 logic in the PCI-ISA bridge, then it
is very slow and much slower then memcpy.


Maxim Shatskih, Windows DDK MVP
StorageCraft Corporation
xxxxx@storagecraft.com
http://www.storagecraft.com

> 1. DO I use READ_REGISTER_BUFFER_ULONG64 instead of

RtlCopyMemory? My device is 32-bit.Should I use
READ_REGISTER_BUFFER_ULONG?

Yes.

  1. What is the “write combining attributes on the memory mapping for your
    writes”? Would you give me a sample code?

Write combining is a capability of the north bridge (between CPU and PCI) of
coalescing the write transactions to consecutive addresses to 1 long PCI-side
burst.

This feature can only be employed by the north bridge if the CPU->north bridge
side writes are marked as “write combineable”. This, in turn, only occurs when
the CPU is writing to the memory thru PTEs with “write combineable” caching
type.

To create such PTEs in Windows, use MmMapIoSpace with proper caching parameter.

Things like video framebuffers are nearly always write combineable, on the
other hand, the control/command registers of the device are never write
combineable.

  1. Is there any common bus masters I can use?

You must have one within your device, there is no “common bus master”.

s there any API?

IoGetDmaAdapter and so on.

Because my device must be compatible with most computer, I need the find a
common bust master to do
this.

There is no such thing as common bus masters.


Maxim Shatskih, Windows DDK MVP
StorageCraft Corporation
xxxxx@storagecraft.com
http://www.storagecraft.com

Just a couple of questions:

  1. Are you speaking about DMA or about copying data from/ to on-device memory by CPU???
    The very fact that you mentioned memcpy() strongly suggests the latter. When Tim said “use DMA”, it was just another way of saying " instead of copying data from/to on-device memory by CPU, transfer data directly to/from your device from/to the main memory", i.e. program devices registers to start DMA and go upon your business until DMA gets completed and interrupt fires …

  2. Are you speaking about PCI device??? I really have no clue what you mean by “common bus master”, but I strongly suspect that you mean DMA controller, i.e. the mechanism that is used by ISA devices since they cannot become bus masters themselves. However, PCI devices must have bus-mastering capability so that they are able to issue cycles on PCI bus…

Anton Bassov

Anton, *must* is a too strong word. It’s HW vendor’s decision if to provide a bus-mastering
capability. It’s ok for a PCI device to not have a DMA controller - so the option of bus-mastering
DMA may not work for the OP.

> the option of bus-mastering DMA may not work for the OP.

Well, somehow I presume that the company the OP works for has a full control over the hardware design of devices that its produces. Certainly they can make a PCI device without bus-mastering capability, and, instead, copy data to on-device memory by memcpy() the way they currently do, or even rely upon on-board DMA controller (shit, the very phrase about PCI device using on-board DMA controller the way ISA devices do sounds ridiculous), but then they should not complain about the low rate of data transfers…

Anton Bassov

>controller (shit, the very phrase about PCI device using on-board DMA
controller

the way ISA devices do sounds ridiculous)

There always was such a support in the PCI-ISA bridge (where the old PC/AT DMA
of port 0x00 resides) and the PCI config space.

I.e. when the BIOS sees such a device with proper config space, it configures
some facility in PCI-ISA bridge and also in the device (via config space) to
use this facility.

Then, when the device wants to run DMA, it sends PCI transactions as busmaster
to the PCI-ISA bridge, not to the main RAM. Then the PCI-ISA bridge uses the
address/length from port 0x00 to run the real transaction against the RAM.

This was called “distributed DMA” and was used by SoundBlaster 16 PCI
replacements.

But this facility only makes the PCI device compatible with ISA one in terms
of software DMA programming
, the PCI device still needed a busmaster. It
bypasses the lack of DREQ/DACK wires in PCI and lack of 0x00 port for PCI, but
not the lack of the busmaster logic in the device.


Maxim Shatskih, Windows DDK MVP
StorageCraft Corporation
xxxxx@storagecraft.com
http://www.storagecraft.com