Scatter gather DMA for huge data size

Hello,

We start developing a PCIe card that will have to read 16GB from RAM into a FIFO in FPGA.
16GB can be located at up to 4194304 4KB pages.
So we will have to set a list that contains 4194304 descriptors.
This can take time.
My question: Is this the right concept ?
Till now, we used only contiguous physical RAM.

Thank you,
Zvika

How long will it take to collect that data? What are you doing with the data? I've done several telemetry projects with continuous DMA where we use a circular buffer. You just need to make sure your processing (user mode?) can consume the data before it wraps. And as long as that's true, the buffer can be much smaller.

I would lean towards what Tim said and use a smaller circular contiguous buffer.

If you really want large though I previously worked for a company that had a FPGA card that captured very large amounts of data. They used bcdedit to truncate the memory used by the OS. The FPGA card then had free use of the unused memory (often 64GB or more). The driver would map in a chunk at a time for processing, I think the limit was something like 2GB. One of the downsides to this approach was that Windows feature updates would wipe out the bcdedit setting. So I would only recommend taking this approach in a very controlled / specialized system.

1 Like

Copying 16 GB may seem like a lot, but it can be done easily via paper tape in 3 or 4 weeks.

probably there isn't a one off transfer in question, but a data rate

Copying 16 GB may seem like a lot, but it can be done easily via paper tape in 3 or 4 weeks.

That made me laugh out loud. I can remember in my mainframe days, at one point we realized that, in terms of bandwidth, the fastest way to send a large file across campus was to have an intern carry a magnetic tape from building to building.

1 Like

Hi All,

In my case, 16GB of data has to be read via DMA from RAM into an FPGA via PCIe. Upon completion, it starts again automatically till user decides to stop.

Before DMA start, the data is already in RAM.

Thank you,
Zvika

Ah, so this is OUTBOUND DMA. As long as you can do scatter/gather, so you don't need contiguous pages, that shouldn't be a big problem. That's, what. 10 or 20 seconds of transferring, depending on the PCIe generation and the number of lanes.

Tim - Thank you very much !
Best regards,
Zvika

Presumably you need to care whether it will be 10 or 20 or some other number of seconds.

Unless you are sending the same data over and over again (like a test pattern), or sending the data only once, there is some kind of rate at which new data will be generated, and so one of your chief interests is not the size of the data, but the rate at which it is generated and therefore needs to be transferred to your device.

I made a joke about it, but probably this is the most important question. It is likely that you won't have a problem with this, but quoting the performance requirement in terms of GB instead of GB/s is probably not very useful

:thinking:

A PCIe Gen 5 x 4 card will transfer on the order of 16GB per second.

What am I missing? (serious question, not sarcasm)

No, you're quite right. I don't know how common Gen 4 and Gen 5 are these days. I never worked with anything beyond Gen 3, and I'm losing touch with the state of the art in retirement. And I should have asked about the number of lanes first.

Scolding accepted.

Oh, please! No scolding intended!!

Gen 4 is common these days; Gen 5 is starting to be available (in working form) on commodity workstations. Scott (here at OSR) has a pretty nice AMD64 workstation with Gen 5 NVMe SSDs... of which I am jealous... that gets better than 13GB/sec sequential reads/writes.

Aside:
It is positively SCARY how quickly PCIe speeds increase... which each generation doubling (doubling!) the throughput of the previous generation. I really do not understand how the hardware designers maintain the integrity of their clock signals at these speeds (and, given the fact that there are lots of systems with "Gen 5" PCIe slots that will train and then not actually sustain transfers at Gen 5 speeds, I'd guess there are a lot of designers who DON'T maintain the integrity of thier clock signals. I am using one such system as I type this).

It's the PCB designers job to maintain signal integrity using controlled impedance for PCIe (Telegrapher's Equations apply). Modern day PCB tools can track differential pairs and, more importantly, stack up. PCIe is effectively a SERDES using PLLs to speed up data rates for a given clock.