[AVStream] Kernel mode custom allocator ?

Bernard_Willaert-2 · July 9, 2014, 8:44am

We have a kernel mode network stream filter that has one input pin that will accept RGB24 frames, just like a renderer.
This filter has a custom allocator on board that supports all AllocatorDispatch callbacks.
The idea is to give a connecting source filter direct access to our internal onboard buffers.
Ideally, a GPU, eg could DMA or memcpy directly into these buffers via this custom allocator.
As a test, I modified the PushSource sample slightly to generate an RGB24 grayscale ramp (value increases per frame), but it did not activate the custom allocator as the documentation clearly states here:
http://msdn.microsoft.com/en-us/library/windows/hardware/ff560976(v=vs.85).aspx
–> Note that memory allocated at kernel level cannot be passed to a user-mode filter.
When I connect the input pin to the output pin of our kernel mode capture filter, the custom allocator IS activated and used.

Is there any alternative to offer direct access to our onboard buffers from a user mode filter ?

Bernard Willaert
Barco - Healthcare Division
Belgium

Tim_Roberts · July 9, 2014, 6:06pm

xxxxx@barco.com wrote:

We have a kernel mode network stream filter that has one input pin that will accept RGB24 frames, just like a renderer.
This filter has a custom allocator on board that supports all AllocatorDispatch callbacks.
The idea is to give a connecting source filter direct access to our internal onboard buffers.
Ideally, a GPU, eg could DMA or memcpy directly into these buffers via this custom allocator.
As a test, I modified the PushSource sample slightly to generate an RGB24 grayscale ramp (value increases per frame), but it did not activate the custom allocator as the documentation clearly states here:
http://msdn.microsoft.com/en-us/library/windows/hardware/ff560976(v=vs.85).aspx
–> Note that memory allocated at kernel level cannot be passed to a user-mode filter.
When I connect the input pin to the output pin of our kernel mode capture filter, the custom allocator IS activated and used.

There are some counterintuitive tradeoffs to be made here. A purely
kernel mode transform filter only makes sense if both sides of the
filter will also be kernel filters. If either side is going to be
user-mode, then you will be paying the overhead of a user/kernel
transition anyway.

So, you might consider, for example, not building an AVStream filter,
but instead creating a user-mode DirectShow filter that uses a generic
interface to call into your driver to do the real work. That way, your
DirectShow filter can expose your custom allocator on both the input and
output pins. You can decide how to map the buffers into user mode.

Is there any alternative to offer direct access to our onboard buffers from a user mode filter ?

There is a fair amount of slime involved in getting DirectDraw overlays
exposed from VMR and Direct3D textures exposed from EVR, and
unfortunately the source is not included. The base classes do have
DirectDraw code in videoctl.cpp, but that’s a user-mode interface.

–
Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

Bernard_Willaert-2 · July 10, 2014, 2:45am

Thanks, Tim !

There are some counterintuitive tradeoffs to be made here. A purely kernel mode transform filter only makes sense if both sides of the filter will also be kernel filters.

In our case, the filter only has one input pin = RGB24 video. The output is a network interface ( video stream ).
This filter controls the internal hardware of our FPGA to control the network stream, so this is (obviously?) a kernel mode filter.

We are thinking of implementing 2 different ways of exposing the video input path:

1/ From a user mode filter that generates RGB24: use the generic allocator scheme of the graph and just do a memcpy from the kernel mode filter input pin to the internal FPGA memory. This will ruin performance but is fine as a proof of concept / testing. We can connect one of the push source sample filters from the SDK directly to stream e.g. a bitmap or desktop. In the real application, frames can be up to 8 Mp.

2/ Provide a custom interface (annex property page) on the filter that exposes the underlaying FPGA memory buffer directly (both as physical address / scatter-gather list for DMA and virtual address for simple memcpy access). This information can be retrieved to user mode and passed to another driver that controls e.g. the GPU DMA. The GPU that composes an image can then DMA images directly into the FPGA memory buffer. In addition to this, we will of course need a proprietary handshake mechanism to signal a new frame arrival that triggers the network transmission. This will also be exposed through the custom interface.

Please let us know what Your opinion is about this implementation?

Thanks again,

Bernard Willaert

Tim_Roberts · July 10, 2014, 1:12pm

xxxxx@barco.com wrote:

Thanks, Tim !
> There are some counterintuitive tradeoffs to be made here. A purely kernel mode transform filter only makes sense if both sides of the filter will also be kernel filters.
In our case, the filter only has one input pin = RGB24 video. The output is a network interface ( video stream ).
This filter controls the internal hardware of our FPGA to control the network stream, so this is (obviously?) a kernel mode filter.

No, it’s not obvious at all. There needs to be a kernel component, but
that component does not necessarily have to be an AVStream filter.

In the real world, people do not build graphs that are 100% Kernel
Streaming components. (It is possible to do so, and there are
frameworks that do this in the audio world, but not for video.) To
three decimal places, 100% of the video capture applications in the
world are DirectShow graphs or Media Foundation topologies.

In both cases, that means that the entire graph actually consists of a
user mode components. When you have an AVStream capture device, the
user mode component will be a wrapper like ksproxy, but the actual graph
participant is a user mode filter.

IF you happen to stumble upon a case where two consecutive filters are
both AVStream, the graph will connect them at the kernel level, and the
data will not pass into user mode. However, that almost never happens.
When you have a kernel node followed by a user-mode node, the data has
to percolate back up to user mode.

So, think about where your RGB24 video is coming from. Unless you are
always coming from a video capture device that natively generates
uncompressed RGB24, your upstream filter will always be in user mode.
That means the buffer communication is going to happen in user mode, and
you are at the mercy of ksproxy’s rules for buffer allocation, which do
not seem to meet your needs.

The very practical alternative is to write the wrapper yourself.
Instead of relying on ksproxy to load and talk to your driver, just
write a DirectShow in-place transform filter that loads your driver
(after removing the AVStream stuff). Now you can implement your
allocator using the DirectShow mechanisms. You have full control.

2/ Provide a custom interface (annex property page) on the filter that exposes the underlaying FPGA memory buffer directly (both as physical address / scatter-gather list for DMA and virtual address for simple memcpy access). This information can be retrieved to user mode and passed to another driver that controls e.g. the GPU DMA. The GPU that composes an image can then DMA images directly into the FPGA memory buffer.

How does a GPU get involved in this? DXVA?

–
Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

Bernard_Willaert-2 · July 11, 2014, 2:55am

Thank You, Tim for this elaborate answer ! Most clarifying again.

>>How does a GPU get involved in this? DXVA?

There is an NVidia GPU board sitting next to our PCIe board. The GPU composes an image and dumps the frame into our onboard memory, from which it is streamed onto the network. This is why we need DMA from the GPU to our internal buffers - for performance reasons. The data is always RGB24 uncompressed - up to 8Mp per frame.

>> So, think about where your RGB24 video is coming from.

The input pin on our filter (that streams data onto the network) would in fact not be used if we perform DMA access. It would only be used if we connect it in a graph to an RGB24 test pattern generator filter. In this case we can do an internal copy to the hardware buffers. This will work for smaller frames at reduced frame rate and is fine for testing the network streaming functionality.

Bernard Willaert

Tim_Roberts · July 11, 2014, 3:03pm

xxxxx@barco.com wrote:

>>> How does a GPU get involved in this? DXVA?
There is an NVidia GPU board sitting next to our PCIe board. The GPU composes an image and dumps the frame into our onboard memory, from which it is streamed onto the network. This is why we need DMA from the GPU to our internal buffers - for performance reasons. The data is always RGB24 uncompressed - up to 8Mp per frame.

Is the GPU processing being handled by another AVStream driver? Or is
it called directly by your driver?

If your typical use case has the GPU shipping into your hardware’s
onboard memory, with no other components, then I’m not sure why you want
DirectShow involved in this at all. What do you gain? You’ll have to
write a custom capture application to make this all work anyway. Why
not just connect these drivers directly, using a custom interface?

The input pin on our filter (that streams data onto the network) would in fact not be used if we perform DMA access. It would only be used if we connect it in a graph to an RGB24 test pattern generator filter. In this case we can do an internal copy to the hardware buffers. This will work for smaller frames at reduced frame rate and is fine for testing the network streaming functionality.

So, as I see it, you have two use cases. Time for some ASCII artwork:

DShow graph

ksproxy
| UM

| KM
GPU______your
driver driver

or

Generator–>ksproxy
| UM

| KM
your
driver

I think I would be tempted to forget about making yourdriver AVStream.
Just make it a custom interface. In the direct-to-GPU case, set it up
in your production application. In the test-generator case, create a
simple DirectShow render filter that talks to the custom interface for
your driver.

–
Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

Bernard_Willaert-2 · July 11, 2014, 5:16pm

Thanks again, Tim for your valuable input.
You are absolutely right about the dilemma: AVStream or not for the transmit part of our board.
Your ASCII artwork is exactly what we were thinking.
The network capture side works fine as an AVStream driver, just like any other capture card.
But for the transmit side - putting a stream on the network - there is very little gain in presenting an AVStream filter. Except for the demo/test usecase where we just copy incoming frames from the inptu pin to the hardware memory.

We will have some reflection on this to see if there is really any gain at all in presenting the TX network part as a DirectShow filter. Maybe contact some potential customers to hear their opinion.

Thank You for your time and advice.

Bernard Willaert