PCI read/write throughput

Hi all,
I’ll have a PCI card with an FPGA reading and writing to RAM on the
card. The memory will be mapped to the driver’s memory space. What’s the
average rate I can get to read/write from/to the card?
What is needed to achieve the best throughput? to set a host to card dma
transfer? who should be the master, the card or the host?

Thanks,
Aaron


You are currently subscribed to ntdev as: $subst(‘Recip.EmailAddr’)
To unsubscribe send a blank email to leave-ntdev-$subst(‘Recip.MemberIDChar’)@lists.osr.com

On the PCI bus the DMA engine lives on the card. To get the fastest throughput have your hardware implement a busmaster DMA engine with hardware scatter-gather support.

Paul Lever
----- Original Message -----
From: Aaron Elberg
To: NT Developers Interest List
Sent: Sunday, April 15, 2001 11:26 PM
Subject: [ntdev] PCI read/write throughput

Hi all,
I’ll have a PCI card with an FPGA reading and writing to RAM on the card. The memory will be mapped to the driver’s memory space. What’s the average rate I can get to read/write from/to the card?
What is needed to achieve the best throughput? to set a host to card dma transfer? who should be the master, the card or the host?

Thanks,
Aaron


You are currently subscribed to ntdev as: xxxxx@home.com
To unsubscribe send a blank email to leave-ntdev-$subst(‘Recip.MemberIDChar’)@lists.osr.com


You are currently subscribed to ntdev as: $subst(‘Recip.EmailAddr’)
To unsubscribe send a blank email to leave-ntdev-$subst(‘Recip.MemberIDChar’)@lists.osr.com

> I’ll have a PCI card with an FPGA reading and writing to RAM on the

card. The memory will be mapped to the driver’s memory space. What’s the
average rate I can get to read/write from/to the card?
What is needed to achieve the best throughput? to set a host to card dma
transfer? who should be the master, the card or the host?

Your achieved transfer rates can vary widely based on a bunch of factors.
From maybe 1 Mbyte/sec (or slower, if your device has lots of burst
latency and your bursts are small) to 120 MBytes/sec (if your bursts are
big, and latency isn’t too horrible) 32-bit PCI bus>. The exact details of the PCI implementation on the FPGA
can have a big impact on performance too, like the size of it’s burst buffers.

Other factors include the code you use to access your device, I assume in
PCI target mode. Unless your device has it’s own bus master support, only
PCI target mode will be available. Generally, this means all transfers go
through the CPU. It might be possible to get OTHER PCI devices to use
their bus master modes to read/write your device in target mode (like a bus
master scsi controller).

I think a much better way to view the problem is how fast do you need to go
to make your device work correctly, and then what PCI implementation and
parameters will it take to achieve that. Also keep in mind that just
because your device PCI interface has sufficient performance, don’t
automatically assume other parts of the OS/application will not be a
bottleneck.

- Jan


You are currently subscribed to ntdev as: $subst(‘Recip.EmailAddr’)
To unsubscribe send a blank email to leave-ntdev-$subst(‘Recip.MemberIDChar’)@lists.osr.com

>

Other factors include the code you use to access your device,
I assume in
PCI target mode. Unless your device has it’s own bus master
support, only
PCI target mode will be available. Generally, this means all
transfers go
through the CPU.
Using DMA transfers don’t involve the CPU, correct? What’s ‘PCI target mode’
means?
.

I think a much better way to view the problem is how fast do
you need to go
to make your device work correctly, and then what PCI
implementation and
parameters will it take to achieve that. Also keep in mind that just
because your device PCI interface has sufficient performance, don’t
automatically assume other parts of the OS/application will not be a
bottleneck.

The purpose of the card is to host an emulator, and the performance is
needed to set probes in the design (store a value of certain point, every
cycle). Since the card’s ram is limited, the fpga stores data in the ram and
the driver should move it to host’s memory. The idea is to find what’s the
max number of probes I can set without (or minimum) degradation of emulation
performance. Since it runs at 1mhz, I need to move 4MB (32 bits each probe)
* number of probes.
Since I’m not familiar with PCI programming, where can I get more
information about this kind of PCI implementation, parameters, etc. Is there
any sample driver for this purpose?

Thanks,
Aaron


You are currently subscribed to ntdev as: $subst(‘Recip.EmailAddr’)
To unsubscribe send a blank email to leave-ntdev-$subst(‘Recip.MemberIDChar’)@lists.osr.com

>Using DMA transfers don’t involve the CPU, correct? What’s ‘PCI target mode’

means?

A PCI target typically means something that looks like memory (or i/o
ports) mapped into the host’s address space. See the PCI specs for the
picky definition. The host processor basically does memory moves to/from
the PCI card address space from/to normal memory address space. A PCI
master basically does it’s own transfers to/from main memory. The CPU just
tells the bus master controller where and how big. Fancy bus master
controllers can do things like walk data structures in main memory,
executing custom commands and understanding/updating memory data
structures. A bus master typically interrupts the CPU when some significant
chunk of work is finished.

PCI bus master do NOT need to transfer data through the CPU, so you could
be moving 120 MByte/sec at close to 0% CPU usage. The contention for memory
bandwidth can slow down the a CPU, depending on available bandwidth. For
example a Pentium 4 with dual RAMBUS channels has LOTS more memory
bandwidth than an older system using EDO RAM.

Since I’m not familiar with PCI programming, where can I get more
information about this kind of PCI implementation, parameters, etc. Is there
any sample driver for this purpose?

I’d suggest looking at the source for the PCI implementation. This is built
into the FPGA your using? Or a MacroCell perhaps? Or possible an extra PCI
interface chip? Whoever created the PCI interface implementation should
know lots about it’s characteristics.

The very simplest driver interface to a target PCI device would probably be
to map the target memory spaces into application memory. For your
application, a question is how does an application know how much data has
been written into the device memory. Is there a count field that’s also
updated in memory? Also how is synchronization between reading the buffer
and writing it handled. Periodically, the host will read some memory,
freeing space, how is this communicated to the FPGA logic?

  • Jan

You are currently subscribed to ntdev as: $subst(‘Recip.EmailAddr’)
To unsubscribe send a blank email to leave-ntdev-$subst(‘Recip.MemberIDChar’)@lists.osr.com

>

A PCI target typically means something that looks like memory (or i/o
ports) mapped into the host’s address space. See the PCI
specs for the
picky definition. The host processor basically does memory
moves to/from
the PCI card address space from/to normal memory address space. A PCI
master basically does it’s own transfers to/from main memory.
The CPU just
tells the bus master controller where and how big. Fancy bus master
controllers can do things like walk data structures in main memory,
executing custom commands and understanding/updating memory data
structures. A bus master typically interrupts the CPU when
some significant
chunk of work is finished.

Initially you say “host processor basically does memory moves…”, then “PCI
master basically does it’s own transfers”, so how you make the difference? I
know how to make the cpu do the job, but how you tell the PCI master to make
the transfers? and (just to be sure I understand), the PCI master is the PCI
controller on the host, right?
For the second case, how is the flow: a new irp triggers the PCI transfers,
and then waits for an interrupt to do whatever is needed after the data is
in place?

I’d suggest looking at the source for the PCI implementation.
This is built
into the FPGA your using? Or a MacroCell perhaps? Or possible
an extra PCI
interface chip? Whoever created the PCI interface
implementation should
know lots about it’s characteristics.
I plan to use a PCI macro cell from xilinx. Meantime I have just seen the
marketing blabla, but they also promise a 130mb/s max throughput

The very simplest driver interface to a target PCI device
would probably be
to map the target memory spaces into application memory. For your
application, a question is how does an application know how
much data has
been written into the device memory. Is there a count field
that’s also
updated in memory? Also how is synchronization between
reading the buffer
and writing it handled. Periodically, the host will read some memory,
freeing space, how is this communicated to the FPGA logic?

What I plan to do is something like this:

  • the fpga writes data to memory
  • near the middle of the buffer it interrupts, notifying data is available
  • device driver gets relevant info from device registers and moves the data
    (by whatever mean)
  • at move completion, driver set relevant info in device registers.
  • during the whole transaction, fpga keeps writing, unless it overrides
    data not moved yet.

Does it make sense? Do you know of a sample driver that does a similar task?

Thanks,
Aaron


You are currently subscribed to ntdev as: $subst(‘Recip.EmailAddr’)
To unsubscribe send a blank email to leave-ntdev-$subst(‘Recip.MemberIDChar’)@lists.osr.com

>Initially you say “host processor basically does memory moves…”, then "PCI

master basically does it’s own transfers", so how you make the difference?

Your PCI hardware has to have logic to implement bus mastering if that’s
what you want. Simpler PCI interface hardware only has a target interface.
You should look at the macro cell specs to determine what it’s capabilities
are.

I
know how to make the cpu do the job, but how you tell the PCI master to make
the transfers? and (just to be sure I understand), the PCI master is the PCI
controller on the host, right?

All data transfers across a PCI bus have master and a target. The most
common master on PC’s is the x86 CPU working with it’s bus interface
chipset. The CPU decides what addresses to present to the PCI bus, and what
operation (read/write). To be squeaky correct, actual PCI transfers have a
lot more details than just a read or a write.

A PCI transaction starts with some master requesting use of the bus. An
arbitration occurs to grant exclusive use of the bus signals. The host PCI
chipset has this arbitration logic in it. Once bus ownership is granted, a
master presents the desired starting address of a burst on the bus. A
target device recognizes the address and claims it. Only one target can
claim a transaction. To claim the transaction, the target raises an
appropriate signal on the bus, which moves the transaction into the data
phase (and often some wait states). In the data phase, the master and
target present data, and toggle PCI signals to strobe data values across
the bus (the PCI clock is really the strobe). Hopefully, you transfer lots
of data values on each PCI transaction. The master and target then release
the bus when they are done (this is a bit simplified). Any master is now
free to arbitrate a new transfer.

The CPU support chipset on a PCI is both a PCI master and target. It
responds to addresses, usually in the hosts’s physical memory range as a
target. It can also initiate master transactions.

A device card can also be a target, master, or both. A target device will
only respond to other masters. If the device is a master, it will typically
respond at some control register addresses as a target, to set parameters
for master transfers.

If your hardware has PCI mastering support, your driver has to program the
device to do the bus master transfers. Once the device is programmed with
it’s bus master parameters, the host cpu is finished with the transfer,
until the device interrupts with status.

A device can try for being the bus master at anytime, interleaving it’s
burst transfers with other masters (like the host cpu) based on
arbitration. Master devices typically round-robin bus ownership.

For the second case, how is the flow: a new irp triggers the PCI transfers,
and then waits for an interrupt to do whatever is needed after the data is
in place?

If the device is bus master capable, the flow would be: an irp passes into
the driver and assuming the device is not busy, programs that device bus
master registers with physical address and length. It then marks the irp as
pending and returns to the OS. Sometime later, the device decides it’s done
with the transfer, and signals an interrupt to the host cpu. The interrupt
service routine decides if it’s the correct device interrupting (PCI busses
share interrupt between multiple devices) and typically queues a DPC.
Eventually, the DPC will start running, and the driver fills in status from
the device and marks the IRP as complete.

For a target device, the irp will come in and typically synchronously be
processed by reading/writing to the device memory as a target (it’s just
some addresses). To be squeaky clean again, this irp actually may have to
be queued to make PnP happy. After the data is transferred, the irp is
completed. Target mode drivers are often simpler than bus master drivers.

I plan to use a PCI macro cell from xilinx. Meantime I have just seen the
marketing blabla, but they also promise a 130mb/s max throughput

PCI marketing literature almost ALWAYS says that. Sometimes they are
telling the truth, sometimes not. It depends on a lot of factors.

What I plan to do is something like this:

  • the fpga writes data to memory
  • near the middle of the buffer it interrupts, notifying data is available
  • device driver gets relevant info from device registers and moves the data
    (by whatever mean)
  • at move completion, driver set relevant info in device registers.
  • during the whole transaction, fpga keeps writing, unless it overrides
    data not moved yet.

And how does the driver EXACTLY synchronize with the byte count? It seems like:

  • the fpga will be writing to memory, updating a count register on every byte
  • at 50% buffer full, it requests an interrupt
  • somewhere between a few microseconds and a bunch of milliseconds later,
    the cpu will respond to the interrupt and queue the dpc
  • sometime later, the dpc will start running, and assuming target access
    mode, will move data from addresses 0 to 50% of the buffer to a buffer
    hopefully passed down by an application
  • the fpga will keep writing to the buffer, and incrementing the count
  • the cpu will finish it’s memory move
  • at this point there is new data from the buffer 50% mark to someplace higher
  • how does the driver now inform the fpga write logic that 0-50% of the
    buffer is free, and the stuff it wrote since the interrupt request is still
    there?
  • the driver returns the buffer full of data to the application

You could set up the target interface to look like a fifo, with a status
bit to tell if ANY data is in the fifo. The driver can then poll the status
bit, and read a byte if it’s available. Note this will cause small PCI
bursts (like 1 byte) and cause your transfer rate to be a small number of
MBytes/sec.

You could also implement a fifo as a buffer and read/write pointers. The
driver keeps unloading the buffer, moving up the read pointer. The fpga
keeps writing at the write pointer, unless it’s the same as the read
pointer, in which case you have buffer overflow (which you might want to
signal in another bit). It’s usually appropriate from drivers to know when
data is lost. This strategy would allow you to do much larger burst
transfers, as you don’t have to poll the status on every byte. You have a
few strategies on knowing when to unload the fifo. One is you just have a
timer fire in the driver, and you look to see of there is any new data. If
the buffer is large, and you were not expecting interrupts that often
anyway, this is a fine way to do things. It also forces the hardware
designer to think about what happens under slow interrupt latency. Hardware
designers sometimes expect maximum latency to be less than it really is,
and you have a broken hardware design than can’t be made to work reliable.
This also simplifies the driver and probably the hardware, as no interrupts
are needed. If minimizing the latency from when the data is captured to
when an application can see it is important, then it’s not a good design,
as you might have to wait for the polling delay to get 1 byte. Think in
terms of polling delay’s of 10’s of milliseconds, which if your producing
100 MByes/sec of new data means you buffers had better be multiple MBytes
in size (rather larger than on chip fpga memory).

As you specific app may not be for general public consumption, you could
also use a strategy of just constantly polling, completing the IRP if there
is no data. Polling intervals may be rather less than many milliseconds.
Your still not guaranteed a maximum latency, so big buffers may still be
required to guarantee correct operation.

If you data rate is high, and your available buffer space is small, you
don’t have much choice but to use bus mastering support. In theory, PCI bus
masters can have guaranteed bus access latency. So if your buffer is 1000
bytes, and you generating 100 MBytes/sec, you will need access to the bus
real fast and often (perhaps imposibly often). Also keep in mind the
difference between AVERAGE transfer rates across the PCI bus, and MINIMUM
rates. For example, if your generating data at 50 MBytes/sec, you buffers
have to be large enough to handle the longest latency you might encounter.
On the average, you might be able to transfer 100 MBytes/sec, but if your
buffer is only 1000 bytes, you absolutely MUST transfer 1000 bytes each
and every 20 microseconds. You will NOT be able to request an interrupt,
and be assured your bus mastering hardware get’s programmed in 20
microseconds. The result may be your hardware, which absolutely can
transfer an average of 100 MBytes/sec will loose data at only 50
MBytes/sec. Drawing a time line of activity can often help a lot.

An solution is to have BIG hardware buffers if your data rates are high.

There are sample PCI drivers for both masters and targets in the DDK.

I hopefully have pointed out some of the deep holes it’s easy to fall into
when designing PCI hardware. It’s better to understand the issues up front,
than having to discover them a few weeks before you want to be shipping a
product, and find it drops data. No amount of driver magic can fix many of
these problems. As a driver writer, I don’t look forward to delivering the
news that a hardware design is broken.

  • Jan

You are currently subscribed to ntdev as: $subst(‘Recip.EmailAddr’)
To unsubscribe send a blank email to leave-ntdev-$subst(‘Recip.MemberIDChar’)@lists.osr.com