AVStream scatter/gather DMA overhead ?

We have created a capture filter with a single video output pin - based on the avshws sample.
Our PCIe device is capable of doing bus master scatter/gather DMA from on-board capture memory to system memory.
The frame size of captured video can be as large as 4 Mpixel @ 3 bytes/pixel –> 12 MByte / frame.
So, worst case for each frame, there are 12 MB / 4 kB = 3000 S/G list entries needed.

According to our HW engineers, this might be a bottleneck in performance when the driver has to refresh the entire S/G mappings in the FPGA for
each individual frame in the pin DispatchProcess() callback - using KSSTREAM_POINTER_OFFSET entries.

This would be the case when the video consumer connected to our output pin (e.g. a renderer) allocates and provides the frame buffers,
in other words, when we set the MemoryFlags member in the KS_FRAMING_ITEM structure to KSALLOCATOR_REQUIREMENTF_MUST_ALLOCATE.

Ideally, we would like to fill the S/G list in hardware only once, before the start of streaming captured video, with maybe enough
entries to span an exact number of frames (e.g. 3 frames) and then recycle this buffer as a ringbuffer.
In order to do this, we probably have to allocate the target capture memory ourselves in the AVStream driver?
Or is a “common buffer” approach a better solution to avoid the continuous S/G mapping list update in the FPGA for each frame ?
and how does that affect performance compared to the direct S/G approach ?

The alternative approach we used in the past - not using AVStream:

  • allocate a large system memory buffer at driver startup
  • program the S/G list in the FPGA based DMA bus master once
  • with each captured frame: DMA this frame into the system memory buffer
  • activate a user mode callback + pointer to the captured frame.
    This has proven to be very efficient.

Thanks in advance for any advice on this.

  • Bernard Willaert
    Software Development Engineer
    Barco - Healthcare division
    Belgium

Typically, it’s the wrong approach to load the whole SGL to your device in advance.

Normally, devices are designed so they only keep the current SGE and pre-fetch the next SGE (if the data rate is high enough that it’s necessary). This may require a contiguous buffer for the SGL itself, so that you can tell the device the physical address of the SGL for fetching it.

It’s reasonable to require that the buffer is only fragmented by pages (which is the typical case for all non-network buffers), so that the device doesn’t have to keep up with fetching too many SGEs.

The “alternate approach” is the worst possible design, because it requires that your contiguous buffer be mapped in the user mode.

xxxxx@hotmail.com wrote:

The frame size of captured video can be as large as 4 Mpixel @ 3 bytes/pixel –> 12 MByte / frame.
So, worst case for each frame, there are 12 MB / 4 kB = 3000 S/G list entries needed.

According to our HW engineers, this might be a bottleneck in performance when the driver has to refresh the entire S/G mappings in the FPGA for each individual frame in the pin DispatchProcess() callback - using KSSTREAM_POINTER_OFFSET entries.

You’re talking about 12kB per frame. That’s not a very large burden.
There is a bit of a timing issue, in that the hardware has to completely
finish uploading the scatter/gather list before it needs to start
filling that frame. As long as the hardware has room for 3 or 4 frames
worth of page entries, that is not a big problem.

This would be the case when the video consumer connected to our output pin (e.g. a renderer) allocates and provides the frame buffers, in other words, when we set the MemoryFlags member in the KS_FRAMING_ITEM structure to KSALLOCATOR_REQUIREMENTF_MUST_ALLOCATE.

That is the vast majority of real-world cases. A preview renderer
always allocates an overlay or texture frame in video memory.

Ideally, we would like to fill the S/G list in hardware only once, before the start of streaming captured video, with maybe enough entries to span an exact number of frames (e.g. 3 frames) and then recycle this buffer as a ringbuffer.

I have done several AVStream drivers this way. The advantage is you
don’t have to manage the scatter/gather lists and the locking and
unlocking. The disadvantage is that you get a copy for each frame.

Also remember that this can ONLY work if you get an interrupt at end of
frame, AND if you can tell absolutely which of the frame buffer(s)
contain new frames. If the hardware designers think you can just stay
in sync by counting, that’s a recipe for disaster. There’s always a
chance that you’ll get two frame completions before you are able to get
into your ISR. You need to be able to handle that.

In order to do this, we probably have to allocate the target capture memory ourselves in the AVStream driver? Or is a “common buffer” approach a better solution to avoid the continuous S/G mapping list update in the FPGA for each frame ? and how does that affect performance compared to the direct S/G approach ?

In general, I have found that the common buffer approach is the easiest
way to allocate a buffer when I need both the virtual and physical
addresses. That has the added benefit of being physically contiguous.

I’m not sure what you’re asking about performance. In terms of the
operating system, there’s no difference at all whether the buffer is
contiguous or not. The cost to the FPGA is something your hardware
designers will have to estimate. Can their DMA engine handle
simultaneous transfers in both directions, or do you have to wait until
the frame download is complete before you can upload another S/G list?

The alternative approach we used in the past - not using AVStream:

  • allocate a large system memory buffer at driver startup
  • program the S/G list in the FPGA based DMA bus master once
  • with each captured frame: DMA this frame into the system memory buffer
  • activate a user mode callback + pointer to the captured frame.
    This has proven to be very efficient.

That’s a gamble. You’re gambling that your user-mode application can
handle every last pixel of the frame before the hardware wants to start
overwriting it. That’s a very difficult thing to defend against.


Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

> xxxxx@hotmail.com wrote:

> The frame size of captured video can be as large as 4 Mpixel @ 3
> bytes/pixel –> 12 MByte / frame.
> So, worst case for each frame, there are 12 MB / 4 kB = 3000 S/G list
> entries needed.
>
> According to our HW engineers, this might be a bottleneck in performance
> when the driver has to refresh the entire S/G mappings in the FPGA for
> each individual frame in the pin DispatchProcess() callback - using
> KSSTREAM_POINTER_OFFSET entries.

I find it odd that they would think that they have to load the entire S/G
list before starting. The usual technique is characterized as “when you
run out of the current address/value pair, go out and fetch the next one
using DMA, continue until list is exhausted”. As indicated below, it may
make sense to buffer a tiny number of S/G frames to reduce the DMA
interlacing and improve response, but the notion that the entire S/G list
has to be loaded FIRST is more than bizarre. The whole point of a S/G
list as that it can be an unlimited “on-demand” way to provide
discontiguous buffering without needing to do copies.
joe

You’re talking about 12kB per frame. That’s not a very large burden.
There is a bit of a timing issue, in that the hardware has to completely
finish uploading the scatter/gather list before it needs to start
filling that frame. As long as the hardware has room for 3 or 4 frames
worth of page entries, that is not a big problem.

> This would be the case when the video consumer connected to our output
> pin (e.g. a renderer) allocates and provides the frame buffers, in other
> words, when we set the MemoryFlags member in the KS_FRAMING_ITEM
> structure to KSALLOCATOR_REQUIREMENTF_MUST_ALLOCATE.

That is the vast majority of real-world cases. A preview renderer
always allocates an overlay or texture frame in video memory.

> Ideally, we would like to fill the S/G list in hardware only once,
> before the start of streaming captured video, with maybe enough entries
> to span an exact number of frames (e.g. 3 frames) and then recycle this
> buffer as a ringbuffer.

I have done several AVStream drivers this way. The advantage is you
don’t have to manage the scatter/gather lists and the locking and
unlocking. The disadvantage is that you get a copy for each frame.

Also remember that this can ONLY work if you get an interrupt at end of
frame, AND if you can tell absolutely which of the frame buffer(s)
contain new frames. If the hardware designers think you can just stay
in sync by counting, that’s a recipe for disaster. There’s always a
chance that you’ll get two frame completions before you are able to get
into your ISR. You need to be able to handle that.

> In order to do this, we probably have to allocate the target capture
> memory ourselves in the AVStream driver? Or is a “common buffer”
> approach a better solution to avoid the continuous S/G mapping list
> update in the FPGA for each frame ? and how does that affect
> performance compared to the direct S/G approach ?

In general, I have found that the common buffer approach is the easiest
way to allocate a buffer when I need both the virtual and physical
addresses. That has the added benefit of being physically contiguous.

I’m not sure what you’re asking about performance. In terms of the
operating system, there’s no difference at all whether the buffer is
contiguous or not. The cost to the FPGA is something your hardware
designers will have to estimate. Can their DMA engine handle
simultaneous transfers in both directions, or do you have to wait until
the frame download is complete before you can upload another S/G list?

> The alternative approach we used in the past - not using AVStream:
> - allocate a large system memory buffer at driver startup
> - program the S/G list in the FPGA based DMA bus master once
> - with each captured frame: DMA this frame into the system memory buffer
> - activate a user mode callback + pointer to the captured frame.
> This has proven to be very efficient.

That’s a gamble. You’re gambling that your user-mode application can
handle every last pixel of the frame before the hardware wants to start
overwriting it. That’s a very difficult thing to defend against.


Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.


NTDEV is sponsored by OSR

Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

Our sincere thanks for all Your valuable inputs.

We will start by implementing and testing the “normal” flow where the connected filter allocates the capture buffers and supplies a s/g list to our capture pin.
The s/g list will be fed into the FPGA DMA engine on the fly, as it is supposed to work.
We will then analyze the PCIe traffic to verify that there are no issues.

The main concern originated from the fact that transferring the s/g list into the FPGA was based on DWORD writes, one at a time, which is not very efficient over PCIe. We will now prepare this list in driver allocated contiguous system memory and indicate to the hardware where it can fetch this list.

Thanks again,

  • Bernard Willaert