PCI max rate

Hello there,

My question is about the maximum rate over the PCI bus with the PLX 9054.

The architecture i use is the “Common Buffer DMA”. For that i
preallocate some buffers of 1MB size over the system ram with the help
of the driver.

The sequence of the actions is as follows:

  1. The adapter is writing the data to the DDRs via the PCI initiator
    mechanism (32DWORD transfers).
  2. Once buffer is full the data are transfered from the DDRs to the hard
    disk with the help of the driver and the software.
  3. CPU usage is 0 - 1 % as it should be because of the DMA transactions.

The maximum rate we achieved during early tests was 120MBits/sec.
With some modifications in PLX registers ( PCIMGR set to 6, PCIMLR set
to A ) we achieved to increase the rate at 150 MBits/sec.

In theory PCI bus can transfer 1GBit/sec.
Is this the maximum rate we must expect concerning the at the same time
the PCI BUS is kept busy by the hard disk as well?
Is there something i can do in order to boost up the performance?

Thank you in advance.

Stylianides Nikolas.

nikolas stylianides wrote:

My question is about the maximum rate over the PCI bus with the PLX 9054.

The architecture i use is the “Common Buffer DMA”. For that i
preallocate some buffers of 1MB size over the system ram with the help
of the driver.

The sequence of the actions is as follows:

  1. The adapter is writing the data to the DDRs via the PCI initiator
    mechanism (32DWORD transfers).
  2. Once buffer is full the data are transfered from the DDRs to the
    hard disk with the help of the driver and the software.
  3. CPU usage is 0 - 1 % as it should be because of the DMA transactions.

The maximum rate we achieved during early tests was 120MBits/sec.
With some modifications in PLX registers ( PCIMGR set to 6, PCIMLR set
to A ) we achieved to increase the rate at 150 MBits/sec.

In theory PCI bus can transfer 1GBit/sec.
Is this the maximum rate we must expect concerning the at the same
time the PCI BUS is kept busy by the hard disk as well?
Is there something i can do in order to boost up the performance?

How are you measuring performance? Are you including the hard disk
time? What kind of hard disk? Many lower-end hard disks can’t sustain
20 MB/s. Have you tried skipping the hard-disk write to see what kind
of raw performance you get from the DMA transfer? I don’t know what
good that information would do you, but it might point to the bottleneck.


Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

> -----Original Message-----

From: xxxxx@lists.osr.com [mailto:bounce-261229-
xxxxx@lists.osr.com] On Behalf Of nikolas stylianides

The maximum rate we achieved during early tests was 120MBits/sec.
With some modifications in PLX registers ( PCIMGR set to 6, PCIMLR set
to A ) we achieved to increase the rate at 150 MBits/sec.

how this number is measured (at bus level or at application level)?

I’m not familiar with plx9054 and the terms they are using in their
manual. I’m assuming PCIMGR is referred to “Minimum Grant” and PCIMLR is
refereed to “Maximum Latency” as defined to “MIN_GNT” and “MAX_LAT” in
PCI local bus spec. I further assume the device is operating at
33Mhz/32-bit PCI mode.

Note that the following discussion is only applicable to PCI bus(as
opposed to PCI-X or PCI Express.)

With some modifications in PLX registers (PCIMGR set to 6, PCIMLR set
to A

Basically, you are setting the length of data phase in a single burst to
6*250 nanoseconds==48 pci clk @33mhz, and telling the system your device
would access the bus every 10*250nanoseconds. How do you choose these
values? How big is your DMA FIFO? How fast can the device fill up the
DMA WRITE FIFO? How fast can the target (host bridge in your case) sink
the data from the master?

These values must be chosen to be bus friendly, such that:

  1. don’t overrun your DMA FIFO because the target can’t eat the data
    fast enough, hence it have to insert WAIT states, or target retry,
    disconnect.
  2. don’t underrun your DMA FIFO because your device is not feeding the
    FIFO fast enough so that master has to insert waits state (since you may
    overbook the bus cycle)
  3. Master don’t request the bus until he has enough data in the fifo so
    that it can sustain a long burst.

However, these are your goals but how the system arbiter handles it may
vary from platform to platform, you’ve been warned.

You didn’t say what the Latency Timer value of your device is. Note that
the MAX_GNT and MAX_LAT settings should not conflict with the Latency
Timer.

In theory PCI bus can transfer 1GBit/sec.

On a 33mhz/32-bit bus, the theoretical max bw is
133Mbytes/sec==1064Gbps. You got this number if the bus only has data
phases, no handshaking and termination, nothing else — that’s of
course impossible because each transaction has overhead. It depends on a
number of facts including but not limited to:

  1. arbitration latency – The clks between master asserts REQ# and its
    GNT# getting asserted)
  2. master data latency – The clks from master asserted FRAME# to
    asserting IRDY#.
  3. target initial latency – the clks from master asserted FRAME# to
    target claiming or terminating the first data phase.
  4. target subsequent latency – a bit complicated to explain w/o a
    whiteboard.
  5. DEVSEL# timing profile – from which subsequent clk after master
    drove the FRAME# to device starts driving the DEVSEL#
  6. WAIT states inserted by either master or target for any reasons.
  7. The PCI commands being used during a burst
  8. mean size of DMA bursts, longer burst size can compensate the
    overhead but hurts latency.
  9. cacheline size and the DMA boundary
  10. the way transactions are terminated (by master or target; if by
    target, how does target terminate it, Disconnect or retry or something
    else) and so on and so on…

It also depends on what other master(s) on the bus is doing. PCI is
inherently a time sharing half duplex bus. i.e. I’ve a full duplex
Gigabit NIC sending a lot of packet but at the same time the NIC’s DMA
Write engine can not move received packets to the host memory until the
DMA Read engine yielded the bus. Bad, eh?

Is this the maximum rate we must expect concerning the at the same
time
the PCI BUS is kept busy by the hard disk as well?

Once you figure out the overhead that you can’t avoid, you pretty much
know how much to expect. A Gigabit NIC on a regular PCI slot delivers
about 700Mbps TCP stream send data is considered pretty good number. It
has already saturated the PCI bus bw limit if you take into accound the
rx DMA traffic, PCI accesses to update the hw ring indices and moving
the SGL across the bus. BTW, Gbe requires 2Gbps+ bi-directional BW at
bus level.

Is there something i can do in order to boost up the performance?

You are asking a tough question here:)
Find out where the bottleneck was first. It could be in hw, sw, or could
be due to explicit interactions between s/w components as well as hw. (a
classic headache of NIC driver). You can start with hooking up a bus
analyzer to watch your DMA. Also look into the driver for hot spots.
Understanding the behavior of the application is very important. You
would need to make tradeoff between throughout and latency at some point
of time. In general, by using general purpose PCI controller such as
9054, you may not be able to achieve the same performance as ASIC
does(like Ethernet controller, FC HBA etc) because their DMA engines and
the internal state machines/blocks are tightly coupled and requests are
often pipelined, so they can do very aggressive DMA at the bus level.
But 150Mbps seems a little bit low so you should have room.

In the end, performance tuning is very interesting and *frustrating*
work. You might get some boost in some applications but break others
depending on how complex your hw, driver are and the application that
uses your hw.

Good luck!

Calvin Guan (DDK MVP) Sr. Staff Engineer
NetXtreme Vista/Longhorn Server Miniport
Broadcom Corporation, Irvine CA 92618
Connecting Everything(r)

> However, these are your goals but how the system arbiter handles it may

vary from platform to platform, you’ve been warned.

Usually arbitration is more-or-less fair and is done using a variant of a
round-robin algorithm. Still, I’ve seen really unfair arbitration done by a
specific PCI chipset: it looked like PCI slots had different fixed priorities
that depended only on the slot numbers.

Dmitriy Budko
VMware

Some chipsets may just ignore the min_gnt, max_lat, and latency timer.

Calvin Guan (DDK MVP) Sr. Staff Engineer
NetXtreme Vista/Longhorn Server Miniport
Broadcom Corporation, Irvine CA 92618
Connecting Everything(r)

-----Original Message-----
From: xxxxx@lists.osr.com [mailto:bounce-261380-
xxxxx@lists.osr.com] On Behalf Of Dmitriy Budko
Sent: Wednesday, September 06, 2006 3:28 PM
To: Windows System Software Devs Interest List
Subject: RE: [ntdev] PCI max rate

> However, these are your goals but how the system arbiter handles it
may
> vary from platform to platform, you’ve been warned.

Usually arbitration is more-or-less fair and is done using a variant
of a
round-robin algorithm. Still, I’ve seen really unfair arbitration
done by
a
specific PCI chipset: it looked like PCI slots had different fixed
priorities
that depended only on the slot numbers.

Dmitriy Budko
VMware

Thank you for your answers.
Especially those given by Calvin Guan.
I will look into this and i will get back to you.

Again thank you very much.

I have checked it out with the directives provided by Calvin Guan.
Calvin you are right. Max rate was quite as it should be.

Thank you very much.

Nikolas Stylianides.