Huge DMA size from FPGA to PC's RAM

Hello,

Every 50msec, my FPGA generates a PCIe interrupt.

Upon an interrupt, the device driver has to build a descriptors list written to the FPGA and start a DMA from FPGA to PC’s RAM.

The problem: Total size of DMA is 170MB.

The minimum size of each descriptor can be 4KB so in the worst case I must build a list contains ~42,000 descriptors.

Is there a bettern mechanism ?

As I was advised here (several times), allocation a contiguous physical RAM is a bad idea.

Windows version: 7 (64bits)

Best regards,
Z.V

Are you concerned about latency because the page list is very big?

A well designed device shall DMA the page descriptors in as as it goes and would not require to keep the whole list on chip. A well designed driver would prepare the SGL in advance and program it to the device, so that the device would DMA the data as it’s being gathered.

Zvi Vered wrote:

Every 50msec, my FPGA generates a PCIe interrupt.

Upon an interrupt, the device driver has to build a descriptors list
written to the FPGA and start a DMA from FPGA to PC’s RAM.

The problem: Total size of DMA is 170MB.

What generation and how many lanes? You’re asking for 3.4 GB/s.

The minimum size of each descriptor can be 4KB so in the worst case I
must build a list contains ~42,000 descriptors.

Is there a bettern mechanism ?

As I was advised here (several times), allocation a contiguous
physical RAM is a bad idea.

Is it the same buffer every time? If so, just create the scatter/gather
list once. As long as you support scatter/gather, you don’t need it to
be physically contiguous.


Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

Your data rate is pretty decent, but a 40 GbE NIC is running at higher data rates, and likely millions of descriptors/sec. Of course the DMA engine in a 40 GbE NIC had a LOT of careful design put into it to achieve good performance at this data and request rate. A 40 GbE NIC will also spread the processing load across a bunch of processors with things like RSS.

To echo what’s been said by others here, you should be preparing the descriptor list while it’s processing the previous one.

Is it a hardware limitation that each descriptor can only process 4K, or is that just what you believe each descriptor will be due to typical memory being 4K pages. Your app could allocate memory in large pages (2M), and when those request get translated by the DMA adapter, you find you have a bunch of scatter/gather entries that can be coalesced into 2M fragments. Assuming your hardware supports 2 Mbyte fragments, that’s only 82 descriptors per 170 Mbyte request. This only works if you are processing requests from your application, that you know is using large pages. If you are just processing requests from arbitrary applications then 4K fragments may be what you have to deal with.

An app can allocate large pages using VirtualAlloc with the correct flags.

Jan

From: > on behalf of Zvi Vered
Reply-To: Windows List
Date: Tuesday, October 13, 2015 at 7:56 AM
To: Windows List
Subject: [ntdev] Huge DMA size from FPGA to PC’s RAM

Hello,

Every 50msec, my FPGA generates a PCIe interrupt.

Upon an interrupt, the device driver has to build a descriptors list written to the FPGA and start a DMA from FPGA to PC’s RAM.

The problem: Total size of DMA is 170MB.

The minimum size of each descriptor can be 4KB so in the worst case I must build a list contains ~42,000 descriptors.

Is there a bettern mechanism ?

As I was advised here (several times), allocation a contiguous physical RAM is a bad idea.

Windows version: 7 (64bits)

Best regards,
Z.V


NTDEV is sponsored by OSR

Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer

A side effect of large page allocation is that those pages are locked in memory.

Hi Jan, All,

You asked: “Is it a hardware limitation that each descriptor can only process 4K …”

It is not a hardware limitation. But I suspect that the descriptor list kernel will build will contain many 4K descriptors.

The application will allocate the buffer using VirtualAlloc. Is there a better win32 allocation routine that can cause small descriptors list ?

Tim, the FPGA has 8 GEN3 lanes = 8 * 8Gb/Sec * ~75% = 48Gb / sec

Best regards,
Z.V

If you set the large page flag on VirtualAlloc, you get memory resident 2 Mbyte pages. To the driver everything looks the same, except when you feed the MDLs to the DMAAdapter, the physical address scatter gather list is still for 4K chunks, BUT you will find contiguous 4K chunks that make up the 2M pages. It’s easy to write code to coalesce these into 2M fragments for programming into HW. If your hardware is ok with bigger fragments, and your app is ok with memory resident memory, it can dramatically reduce the number of descriptors the HW needs to process. TLB thrashing goes down too.

MDLs always describe 4K pages, but if an app uses 2 M large pages, you get clusters of physically contiguous pages, like clusters of 512 to create 2M of contiguous space needed to map a large page. Drivers that expect everything to be 4K disjoint pages work just fine with large pages, and drivers that are large page aware can coalesce the fragments when doing DMA. A driver I suppose could fail a request if it didn’t come from large pages, but making it still run, perhaps slower, is likely a better idea. Think of large pages as an optimization opportunity just before doing the DMA, and tell the application developer they might be happier with the performance if they set that flag in VirtualAlloc.

When you talk about a custom FPGA and 3+ Gbyte/sec data rates, you’re likely talking about a less mainstream system.

Jan

On 10/14/15, 12:10 AM, “xxxxx@lists.osr.com on behalf of xxxxx@gmail.com” wrote:

>Hi Jan, All,
>
>You asked: “Is it a hardware limitation that each descriptor can only process 4K …”
>
>It is not a hardware limitation. But I suspect that the descriptor list kernel will build will contain many 4K descriptors.
>
>The application will allocate the buffer using VirtualAlloc. Is there a better win32 allocation routine that can cause small descriptors list ?
>
>Tim, the FPGA has 8 GEN3 lanes = 8 * 8Gb/Sec * ~75% = 48Gb / sec
>
>Best regards,
>Z.V
>
>—
>NTDEV is sponsored by OSR
>
>Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev
>
>OSR is HIRING!! See http://www.osr.com/careers
>
>For our schedule of WDF, WDM, debugging and other seminars visit:
>http://www.osr.com/seminars
>
>To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer

No! While’s it’s obviously correct that an MDL describes a group of pages, one page at a time, the S/G list is automatically coalesced, so that each S/G element can describe more than 4K.

In fact, I’ve had to UN-coalesce those chunks just for that reason.

Peter
OSR
@OSRDrivers

xxxxx@gmail.com wrote:

Tim, the FPGA has 8 GEN3 lanes = 8 * 8Gb/Sec * ~75% = 48Gb / sec

It is important, especially when discussing bus bandwidths where either
interpretation is reasonable, to use GB for gigabytes and Gb for gigabits.


Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

Hi Tim, all,

On 14 okt. 2015, at 19:54, Tim Roberts wrote:
>
> xxxxx@gmail.com wrote:
>> Tim, the FPGA has 8 GEN3 lanes = 8 * 8Gb/Sec * ~75% = 48Gb / sec
>
> It is important, especially when discussing bus bandwidths where either
> interpretation is reasonable, to use GB for gigabytes and Gb for gigabits.
>
Even then it is unclear what the notion “G” means.

I would recommend to follow SI notation:

https://en.m.wikipedia.org/wiki/Binary_prefix

Regards,

Leon

@Leon:

Transmission and signalling rates are most of the time described in decimal multipliers.

Thanks for the correction Peter!

On 10/14/15, 5:51 AM, “xxxxx@lists.osr.com on behalf of xxxxx@osr.com” wrote:

>


>
>No! While’s it’s obviously correct that an MDL describes a group of pages, one page at a time, the S/G list is automatically coalesced, so that each S/G element can describe more than 4K.
>
>In fact, I’ve had to UN-coalesce those chunks just for that reason.
>
>Peter
>OSR
>@OSRDrivers
>
>
>–

I usually try hard to use Gbytes or Gbits, which even if you don’t know the GB/Gb convention is pretty clear. It still does leave the ambiguity if G = 1024^3 or 1000^3, but does avoid the off by x8 error.

Jan

On 10/14/15, 10:54 AM, “xxxxx@lists.osr.com on behalf of Tim Roberts” wrote:

>xxxxx@gmail.com wrote:
>> Tim, the FPGA has 8 GEN3 lanes = 8 * 8Gb/Sec * ~75% = 48Gb / sec
>
>It is important, especially when discussing bus bandwidths where either
>interpretation is reasonable, to use GB for gigabytes and Gb for gigabits.
>

Hi Jan, All,

Thank you for your detailed explanation.
I will soon test your solution and update.

Best regards,
Z.V

-----Original Message-----
From: Jan Bottorff
Sent: Wednesday, October 14, 2015 1:55 PM
To: Windows System Software Devs Interest List
Subject: Re: [ntdev] Huge DMA size from FPGA to PC’s RAM

If you set the large page flag on VirtualAlloc, you get memory resident 2
Mbyte pages. To the driver everything looks the same, except when you feed
the MDLs to the DMAAdapter, the physical address scatter gather list is
still for 4K chunks, BUT you will find contiguous 4K chunks that make up the
2M pages. It’s easy to write code to coalesce these into 2M fragments for
programming into HW. If your hardware is ok with bigger fragments, and your
app is ok with memory resident memory, it can dramatically reduce the number
of descriptors the HW needs to process. TLB thrashing goes down too.

MDLs always describe 4K pages, but if an app uses 2 M large pages, you get
clusters of physically contiguous pages, like clusters of 512 to create 2M
of contiguous space needed to map a large page. Drivers that expect
everything to be 4K disjoint pages work just fine with large pages, and
drivers that are large page aware can coalesce the fragments when doing DMA.
A driver I suppose could fail a request if it didn’t come from large pages,
but making it still run, perhaps slower, is likely a better idea. Think of
large pages as an optimization opportunity just before doing the DMA, and
tell the application developer they might be happier with the performance if
they set that flag in VirtualAlloc.

When you talk about a custom FPGA and 3+ Gbyte/sec data rates, you’re likely
talking about a less mainstream system.

Jan

On 10/14/15, 12:10 AM, “xxxxx@lists.osr.com on behalf of
xxxxx@gmail.comxxxxx@gmail.com> wrote:

>Hi Jan, All,
>
>You asked: “Is it a hardware limitation that each descriptor can only
>process 4K …”
>
>It is not a hardware limitation. But I suspect that the descriptor list
>kernel will build will contain many 4K descriptors.
>
>The application will allocate the buffer using VirtualAlloc. Is there a
>better win32 allocation routine that can cause small descriptors list ?
>
>Tim, the FPGA has 8 GEN3 lanes = 8 * 8Gb/Sec * ~75% = 48Gb / sec
>
>Best regards,
>Z.V
>
>—
>NTDEV is sponsored by OSR
>
>Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev
>
>OSR is HIRING!! See http://www.osr.com/careers
>
>For our schedule of WDF, WDM, debugging and other seminars visit:
>http://www.osr.com/seminars
>
>To unsubscribe, visit the List Server section of OSR Online at
>http://www.osronline.com/page.cfm?name=ListServer


NTDEV is sponsored by OSR

Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer