Maximum Performance in PCI transfers

Hello.
I am working in a PCI implementation that do intensive number crunching, that is it uses massive data to perform calculations.
Data are provided by the CPU.
The implementation is running as expected, does the work as intended, and I get pretty nice figures on performance.
I am using an ordinary PCI board, without DMA, without bursting and without PCI-X goodies.
All data transfer is done using ordinary I/O (using a memory mapped section).
My current board does not support DMA.

Although current performance is good I would need to improve it.
In my tests I have checked that most performance penalty comes from the data transfers so I suspect that an improvement in data transfer rates would give better figures in overall performance.

In your experience which is the best way to increase data transfer rate substantially on PCI cards that use massive data ?
Bursting, DMA access, switch to PCI-X,…?

In general, in my case, cost is not a major issue. I want a substantial increase in data transfer rate no matter the cost.
Complexity is important but is not a crucial point. It is interesting that the board does not require a specific mother board but PCI-X bus would be acceptable.

Thanks for your answers.
Inaki.

In my experience (my company designed a custom PCI card interface and I
wrote the driver), you can’t get PCI burst cycles (at least not
guaranteed) unless you do bus master DMA. If you don’t get the PCI to
burst your cycles, you are only getting about 1/8 the performance. We
have a 64bit 66Mhz card, and a normal PCI PIO cycle takes at least
120ns/transfer (240ns for a 33Mhz bus), while a burst (which removes the
overhead of re-negotiating the PCI cycle) it does 15ns/cycle (30ns for a
33Mhz bus). So you can see that getting the cycle to burst on the PCI bus
is imperative, and you can only really do this with bus master
DMA.

Our card can achieve over 250 MBytes/sec sustained transfer rate using
scatter/gather bus master DMA. Scaling that to a 33Mhz 32bit PCI bus
would still give over 60 MBytes/sec, however our application needs the
250 MBytes/sec, yours would probably do just fine with a 33Mhz/32bit PCI
bus if you implement SG DMA. It may not even need Scatter/Gather, DMA
alone will give a significant boost.

At 10:24 AM 5/31/2005, you wrote:

Hello.

I am working in a PCI implementation that do intensive number crunching,
that is it uses massive data to perform calculations.

Data are provided by the CPU.

The implementation is running as expected, does the work as intended, and
I get pretty nice figures on performance.

I am using an ordinary PCI board, without DMA, without bursting and
without PCI-X goodies.

All data transfer is done using ordinary I/O (using a memory mapped
section).

My current board does not support DMA.

 

Although current performance is good I would
need to improve it.

In my tests I have checked that most performance penalty comes from the
data transfers so I suspect that an improvement in data transfer rates
would give better figures in overall performance.

 

In your experience which is the best way to
increase data transfer rate substantially on PCI cards that use massive
data ?

Bursting, DMA access, switch to PCI-X,…?

 

In general, in my case, cost is not a major
issue. I want a substantial increase in data transfer rate no matter the
cost.

Complexity is important but is not a crucial point. It is interesting
that the board does not require a specific mother board but PCI-X bus
would be acceptable.

 

Thanks for your answers.

Inaki.

 

 


Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256

You are currently subscribed to ntdev as: unknown lmsubst tag argument: ‘’

To unsubscribe send a blank email to xxxxx@lists.osr.com


Russ Poffenberger

Credence Systems Corp.

xxxxx@credence.com

You never get real performance without DMA even though
modern CPU is fast. One reason is that the overhead
(initiator requests, target claims, termination) of a
single bus access it’s HUGE. So switch to DMA if you
could.

In DMA mode, you want to keep the burst transaction
efficient. There’re many points to consider but some
most common ones to look into:

1). Use MemoryReadMultiple instead of MemReadLine, you
want a long burst.

2). Keep your internal DMA FIFO full as much as
possible

3). Hook up an analyzer to the bus, watch the burst
transaction, see if there is any unnecessary wait
state inserted. If you’re using MemReadMultiple, make
sure target doesn’t disconnect between 2 CacheLineFill
transactions.

HTH,

Calvin Guan Windows DDK MVP
Staff SW Engineer, NetXtreme MINIPORT
Enterprise Network Controller Engineering
Broadcom Corporation www.broadcom.com

— Iñaki Castillo wrote:
> Hello.
> I am working in a PCI implementation that do
> intensive number crunching, that is it uses massive
> data to perform calculations.
> Data are provided by the CPU.
> The implementation is running as expected, does the
> work as intended, and I get pretty nice figures on
> performance.
> I am using an ordinary PCI board, without DMA,
> without bursting and without PCI-X goodies.
> All data transfer is done using ordinary I/O (using
> a memory mapped section).
> My current board does not support DMA.
>
> Although current performance is good I would need to
> improve it.
> In my tests I have checked that most performance
> penalty comes from the data transfers so I suspect
> that an improvement in data transfer rates would
> give better figures in overall performance.
>
> In your experience which is the best way to increase
> data transfer rate substantially on PCI cards that
> use massive data ?
> Bursting, DMA access, switch to PCI-X,…?
>
> In general, in my case, cost is not a major issue. I
> want a substantial increase in data transfer rate no
> matter the cost.
> Complexity is important but is not a crucial point.
> It is interesting that the board does not require a
> specific mother board but PCI-X bus would be
> acceptable.
>
> Thanks for your answers.
> Inaki.
>
>
>
> —
> Questions? First check the Kernel Driver FAQ at
> http://www.osronline.com/article.cfm?id=256
>
> You are currently subscribed to ntdev as: unknown
> lmsubst tag argument: ‘’
> To unsubscribe send a blank email to xxxxx@lists.osr.com

______________________________________________________________________
Post your free ad now! http://personals.yahoo.ca

> From: Inaki Castillo

Sent: Tuesday, May 31, 2005 10:24 AM
To: Windows System Software Devs Interest List
Subject: [ntdev] Maximum Performance in PCI transfers

In your experience which is the best way to increase
data transfer rate substantially on PCI cards that
use massive data ?
Bursting, DMA access, switch to PCI-X,…?

  1. Very easy: use CPU write combining mapping/caching for
    the memory mapped IO to your board.

MmMapIoSpace(…, MmWriteCombined);

It allows a PCI chipset to use PCI bursting for CPU writes.
From my experience it increases useful write bandwidth from
15-20 MB/s to 100-110 MB/s on a standard 32-bit/33 MHz PCI
bus.

  1. Use SG DMA.

  2. Switch to PCI Express (PCIe)

Dmitriy Budko, VMware

Be very careful while using WriteCombining. Keep in
mind that the order of individual “write” arrived to
the buffer may be different from the original one and
you have absolutely no control over the timing like
when the cache line will be written to the HW. Make
sure the HW is ok with out-of-order write access and
doesn’t care much about the timing.

In general, WC is good for video frame buffer but not
device.


Calvin Guan Windows DDK MVP
Staff SW Engineer, NetXtreme MINIPORT
Enterprise Network Controller Engineering
Broadcom Corporation www.broadcom.com

— Dmitriy Budko wrote:
> > From: Inaki Castillo
> > Sent: Tuesday, May 31, 2005 10:24 AM
> > To: Windows System Software Devs Interest List
> > Subject: [ntdev] Maximum Performance in PCI
> transfers
> >
> > In your experience which is the best way to
> increase
> > data transfer rate substantially on PCI cards that
>
> > use massive data ?
> > Bursting, DMA access, switch to PCI-X,…?
>
> 1) Very easy: use CPU write combining
> mapping/caching for
> the memory mapped IO to your board.
>
> MmMapIoSpace(…, MmWriteCombined);
>
> It allows a PCI chipset to use PCI bursting for CPU
> writes.
> From my experience it increases useful write
> bandwidth from
> 15-20 MB/s to 100-110 MB/s on a standard 32-bit/33
> MHz PCI
> bus.
>
>
> 2) Use SG DMA.
>
> 3) Switch to PCI Express (PCIe)
>
>
> Dmitriy Budko, VMware
>
>
> —
> Questions? First check the Kernel Driver FAQ at
> http://www.osronline.com/article.cfm?id=256
>
> You are currently subscribed to ntdev as: unknown
> lmsubst tag argument: ‘’
> To unsubscribe send a blank email to
> xxxxx@lists.osr.com
>

______________________________________________________________________
Post your free ad now! http://personals.yahoo.ca

At 02:27 PM 5/31/2005, you wrote:

  1. Very easy: use CPU write combining mapping/caching for
    the memory mapped IO to your board.

MmMapIoSpace(…, MmWriteCombined);

It allows a PCI chipset to use PCI bursting for CPU writes.
From my experience it increases useful write bandwidth from
15-20 MB/s to 100-110 MB/s on a standard 32-bit/33 MHz PCI
bus.

I have found this to be unreliable, not only because the writes can be out
of order, but also because it is very chipset dependent. If you want
something that will always work, you need to use DMA.

Russ Poffenberger
Credence Systems Corp.
xxxxx@credence.com

He could optimize his consecutive PIO accesses with
“PCI Fast Back To Back” transfer. That requires both
master and target meet the timing constrain as defined
in PCI local bus spec. Check my post on MS newsgroup.

http://groups-beta.google.com/group/microsoft.public.development.device.drivers/browse_frm/thread/76bf982597f176dd/6ed01f9cb9a9f795?q=FBTB+calvin&rnum=1&hl=en#6ed01f9cb9a9f795


Calvin Guan Windows DDK MVP
Staff SW Engineer, NetXtreme MINIPORT
Enterprise Network Controller Engineering
Broadcom Corporation www.broadcom.com

— Russell Poffenberger
wrote:

> At 02:27 PM 5/31/2005, you wrote:
> >1) Very easy: use CPU write combining
> mapping/caching for
> >the memory mapped IO to your board.
> >
> >MmMapIoSpace(…, MmWriteCombined);
> >
> >It allows a PCI chipset to use PCI bursting for CPU
> writes.
> > From my experience it increases useful write
> bandwidth from
> >15-20 MB/s to 100-110 MB/s on a standard 32-bit/33
> MHz PCI
> >bus.
>
> I have found this to be unreliable, not only because
> the writes can be out
> of order, but also because it is very chipset
> dependent. If you want
> something that will always work, you need to use
> DMA.
>
>
>
> Russ Poffenberger
> Credence Systems Corp.
> xxxxx@credence.com
>
>
> —
> Questions? First check the Kernel Driver FAQ at
> http://www.osronline.com/article.cfm?id=256
>
> You are currently subscribed to ntdev as:
> xxxxx@yahoo.ca
> To unsubscribe send a blank email to
> xxxxx@lists.osr.com
>

__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com

At 12:03 PM 6/1/2005, you wrote:

He could optimize his consecutive PIO accesses with
“PCI Fast Back To Back” transfer. That requires both
master and target meet the timing constrain as defined
in PCI local bus spec. Check my post on MS newsgroup.

Doesn’t this still imply that the bridge implementation support it? The
bottom line is that anytime the CPU is the master, you are at the whim of
the capabilities of the particular CPU/bridge implementation. When you do
bus master DMA, YOU (the target) are in control.

Russ Poffenberger
Credence Systems Corp.
xxxxx@credence.com

DMA and bursting.

Maxim Shatskih, Windows DDK MVP
StorageCraft Corporation
xxxxx@storagecraft.com
http://www.storagecraft.com

----- Original Message -----
From: I?aki Castillo
To: Windows System Software Devs Interest List
Sent: Tuesday, May 31, 2005 9:24 PM
Subject: [ntdev] Maximum Performance in PCI transfers

Hello.
I am working in a PCI implementation that do intensive number crunching, that is it uses massive data to perform calculations.
Data are provided by the CPU.
The implementation is running as expected, does the work as intended, and I get pretty nice figures on performance.
I am using an ordinary PCI board, without DMA, without bursting and without PCI-X goodies.
All data transfer is done using ordinary I/O (using a memory mapped section).
My current board does not support DMA.

Although current performance is good I would need to improve it.
In my tests I have checked that most performance penalty comes from the data transfers so I suspect that an improvement in data transfer rates would give better figures in overall performance.

In your experience which is the best way to increase data transfer rate substantially on PCI cards that use massive data ?
Bursting, DMA access, switch to PCI-X,…?

In general, in my case, cost is not a major issue. I want a substantial increase in data transfer rate no matter the cost.
Complexity is important but is not a crucial point. It is interesting that the board does not require a specific mother board but PCI-X bus would be acceptable.

Thanks for your answers.
Inaki.


Questions? First check the Kernel Driver FAQ at http://www.osronline.com/article.cfm?id=256

You are currently subscribed to ntdev as: unknown lmsubst tag argument: ‘’
To unsubscribe send a blank email to xxxxx@lists.osr.com