Yet another newbie kernel/user memory mapping question

Hello ntdev list:

I am a first time windows driver dev, seeking help from experienced driver writers such as yourselves.
I am also newly joined to ntdev and this is my first question after several months perusing the list archives.
My question is about bandwidth performance of memory-mapped I/O regions between device and host, and whether
I am setting these up correctly for my application. I understand that “Mapping kernel memory to user space
is a complex and error-prone technique that in general should be ignored as a possible solution” but that was
the design decision that flowed down to me. I hope this is not too much tl;dr …

Driver project background:

  • A custom PCIe analog video capture board based on TI DM8168.
  • Using KMDF/vs2012 to develop the windows driver. A linux driver is available as a reference design.
  • Ring buffers containing video data are mapped to shared user memory on the host; not using DMA.
  • The board PCIe configuration supports the concept of mapping host physical addresses as inbound and outbound.
  • Inbound memory resides on the board, Outbound resides on the host.
  • Mapped regions must be contiguous and 1M aligned.
  • There are 32 available outbound mapping slots, size 1|2|4|8 MB each. We are setting them all to 1M and combining together in userland.

Using the msdn WDF online documentation and driver example code, I was able to put together a functional windows
driver that can load firmware (linux uboot+kernel+fs) to the device and boot it. The resource list that is presented
by PnP manager consists of 4 BARs:
Bar0 : 4k, pci config space
Bar1 : 8M, memory
Bar2 : 16M, memory
Bar3 : 256M, memory
I am able to use MmMapIoSpace on bars 0-2, but mapping bar3 usually fails due to insufficient resources so I don’t map it.
I would like to use a portion of bar3 as host memory (outbound for the device) but I don’t know how to do that
(my understanding is that MmMapIoSpace does not produce system virtual addresses that represent physical host addresses).

Getting the outbound video streams is another matter. In the linux reference driver, this is done by allocating numerous
contiguous 1M physical chunks (1M aligned) and stitching them together into a single virtual address. This is because
obtaining a larger contiguous/aligned section of physical memory is hard to do if the driver gets reloaded too many times,
due to fragmentation. Also, each outbound mapping register on the device requires a host physical address.
In my windows driver I tried to something similar like this (error handling omitted for clarity):

const SIZE_T pagesPerHostBuffer = BYTES_TO_PAGES(1*1024*1024);
const SIZE_T pfnSize = sizeof(PFN_NUMBER) * pagesPerHostBuffer;

// Allocate large MDL which will be mapped to user space
PMDL pHostMdl = IoAllocateMdl(
NULL, // This MDL will not have any associated system virtual address.
hostMdlSize, // Comprises the entire group of host memory regions.
FALSE, // Not associated with Irp, not part of an MDL chain.
FALSE, // charegQuota always false
NULL // No associated Irp
);

// Allocate and map smaller MDL aligned buffers
for (ULONG index=0; index<host_num_buffers> {
PMDL pMdl = MmAllocatePagesForMdlEx(
LowAddress, // 0
HighAddress, // ~0
SkipBytes, // 110241024
110241024,
MmNonCached,
MM_ALLOCATE_REQUIRE_CONTIGUOUS_CHUNKS
);

PVOID pVirt = MmMapLockedPagesSpecifyCache(
pMdl,
KernelMode,
MmNonCached,
NULL,
FALSE,
HighPagePriority
);

// the source of the physical page data within the small MDL …
PPFN_NUMBER pSrc = MmGetMdlPfnArray(pMdl);

// … will be copied in order to the single large MDL.
PPFN_NUMBER pDst = &MmGetMdlPfnArray(pHostMdl)[index * pagesPerHostBuffer];

RtlCopyMemory(pDst, pSrc, pfnSize);

pHostMdl->MdlFlags = pMdl->MdlFlags;
}

// Finally, Map the large FrankenMDL to user space
PVOID user_addr = MmMapLockedPagesSpecifyCache(
pHostMdl,
UserMode,
MmNonCached,
NULL, // Allow the system to choose the user-mode starting address
FALSE, // Disable BugCheckOnFailure since the mapping is to user mode
HighPagePriority // Indicates the importance of success when PTEs are scarce.
);

This method (kind of pseudo scatter-gather) seems to work in terms of a) not crashing and b) getting
board data to the host, but it really feels like a kludge solution and I have just been lucky so far.
In addition, the performance is miserable compared to the linux driver (about 1/10th rate of transfer
using simple non-optimized memcpy test on the same underlying hardware). For reference, the linux
version also declares all its mappings as noncached. I realize this all begs for a redesign using
DMA but it’s very unlikely to turn out that way.

I would appreciate any suggestions of how to structure this better (without having to involve DMA)
or any hints how to get better performance.

Thanks in advance.</host_num_buffers>

Why do you allocate non-cached host memory? There is no reason to use non-cached host memory.

You CANNOT JUST COPY PFN array. Create and MDL for non-paged address (IoAllocateMdl/MmBuildMdlForNonPagedPool) and have it mapped to used mode.

And the device design, if it doesn’t support DMA and requires giant BARs, is totally retarded.

If it’s not using DMA, then it doesn’t really need contiguous physical memory. If it really needs it, then it’s actually using DMA.

If you want to be successful without having DMA, just have the application send IOCTL requests with a METHOD_OUT_DIRECT buffer type. The driver, when necessary, will do memcpy from the device BAR to the application buffer (using the system mapped virtual address, of course).

This way, you don’t need to mess with shared buffers. The IOCTL solution will be as fast as shared buffers, and much more reliable.

This device was clearly not designed for Windows. Are you trying to support
standard video capture in Windows or your own custom application only? I
have never looked at this in detail, but the name DirectShow come to mind
for the standard model but others who know better will likely correct my
ignorance. Before deciding on a design, I suggest you at least research the
in box capabilities of the OS as you may be able to get a lot of
functionality for free from the OS. If this device is for general
distribution, then this may score points with your users too as they will
expect that their other applications will be able to use your device too.

As an aside, video devices are notorious for poor design choices by hardware
engineers; this one looks like another example. Also, the linux driver you
are using for reference will probably retard rather than expede your
progress as except for a small number of hardware details, the designs must
be so radically different that it is likely a distraction.

re your specific questions

  • mapping a 256 MB BAR is infeasible except in embedded situations.
    Normally, these resources are either mapped in their entirety or not at all
  • a device with no DMA greatly simplifies the issues with shared memory, as
    you don’t have to synchronize with hardware and software, but short of
    mapping device memory into UM, an obvious and horrible security issue in
    anything except embedded systems, there is no performance gain. The key
    point to realize here is that your system of synchronization / signalization
    between application and driver / hardware, if correct, is will incur the
    exact same overhead as the OS provided mechanism (IOCTL) except in a few
    very specific cases. And from an economic point of view, Microsoft has
    already invested man years (decades?) into coding and testing the inbox
    logic that you would have to replicate
  • without DMA support, physically continuity is irreverent, so you have no
    need to do any stitching. Physical continuity is only relevant when a
    device that does not reference the page tables (not a CPU) makes memory
    modifications or reads. Also, if there is no DMA, then non-cached makes no
    sense because the CPU(s) will implement a cache coherency protocol and stay
    in sync. the only need for non-cached memory is if the device will read or
    update memory directly and DMA stands for Direct Memory Access.

I suspect that when you resolve the above, an answer to your performance
question will become self evident.

wrote in message news:xxxxx@ntdev…

Hello ntdev list:

I am a first time windows driver dev, seeking help from experienced driver
writers such as yourselves.
I am also newly joined to ntdev and this is my first question after several
months perusing the list archives.
My question is about bandwidth performance of memory-mapped I/O regions
between device and host, and whether
I am setting these up correctly for my application. I understand that
“Mapping kernel memory to user space
is a complex and error-prone technique that in general should be ignored as
a possible solution” but that was
the design decision that flowed down to me. I hope this is not too much
tl;dr …

Driver project background:

  • A custom PCIe analog video capture board based on TI DM8168.
  • Using KMDF/vs2012 to develop the windows driver. A linux driver is
    available as a reference design.
  • Ring buffers containing video data are mapped to shared user memory on the
    host; not using DMA.
  • The board PCIe configuration supports the concept of mapping host physical
    addresses as inbound and outbound.
  • Inbound memory resides on the board, Outbound resides on the host.
  • Mapped regions must be contiguous and 1M aligned.
  • There are 32 available outbound mapping slots, size 1|2|4|8 MB each. We
    are setting them all to 1M and combining together in userland.

Using the msdn WDF online documentation and driver example code, I was able
to put together a functional windows
driver that can load firmware (linux uboot+kernel+fs) to the device and boot
it. The resource list that is presented
by PnP manager consists of 4 BARs:
Bar0 : 4k, pci config space
Bar1 : 8M, memory
Bar2 : 16M, memory
Bar3 : 256M, memory
I am able to use MmMapIoSpace on bars 0-2, but mapping bar3 usually fails
due to insufficient resources so I don’t map it.
I would like to use a portion of bar3 as host memory (outbound for the
device) but I don’t know how to do that
(my understanding is that MmMapIoSpace does not produce system virtual
addresses that represent physical host addresses).

Getting the outbound video streams is another matter. In the linux
reference driver, this is done by allocating numerous
contiguous 1M physical chunks (1M aligned) and stitching them together into
a single virtual address. This is because
obtaining a larger contiguous/aligned section of physical memory is hard to
do if the driver gets reloaded too many times,
due to fragmentation. Also, each outbound mapping register on the device
requires a host physical address.
In my windows driver I tried to something similar like this (error handling
omitted for clarity):

const SIZE_T pagesPerHostBuffer = BYTES_TO_PAGES(1*1024*1024);
const SIZE_T pfnSize = sizeof(PFN_NUMBER) *
pagesPerHostBuffer;

// Allocate large MDL which will be mapped to user space
PMDL pHostMdl = IoAllocateMdl(
NULL, // This MDL will not have any associated system virtual
address.
hostMdlSize, // Comprises the entire group of host memory regions.
FALSE, // Not associated with Irp, not part of an MDL chain.
FALSE, // charegQuota always false
NULL // No associated Irp
);

// Allocate and map smaller MDL aligned buffers
for (ULONG index=0; index<host_num_buffers> {
PMDL pMdl = MmAllocatePagesForMdlEx(
LowAddress, // 0
HighAddress, // ~0
SkipBytes, // 110241024
110241024,
MmNonCached,
MM_ALLOCATE_REQUIRE_CONTIGUOUS_CHUNKS
);

PVOID pVirt = MmMapLockedPagesSpecifyCache(
pMdl,
KernelMode,
MmNonCached,
NULL,
FALSE,
HighPagePriority
);

// the source of the physical page data within the small MDL …
PPFN_NUMBER pSrc = MmGetMdlPfnArray(pMdl);

// … will be copied in order to the single large MDL.
PPFN_NUMBER pDst = &MmGetMdlPfnArray(pHostMdl)[index *
pagesPerHostBuffer];

RtlCopyMemory(pDst, pSrc, pfnSize);

pHostMdl->MdlFlags = pMdl->MdlFlags;
}

// Finally, Map the large FrankenMDL to user space
PVOID user_addr = MmMapLockedPagesSpecifyCache(
pHostMdl,
UserMode,
MmNonCached,
NULL, // Allow the system to choose the user-mode
starting address
FALSE, // Disable BugCheckOnFailure since the mapping
is to user mode
HighPagePriority // Indicates the importance of success when PTEs
are scarce.
);

This method (kind of pseudo scatter-gather) seems to work in terms of a) not
crashing and b) getting
board data to the host, but it really feels like a kludge solution and I
have just been lucky so far.
In addition, the performance is miserable compared to the linux driver
(about 1/10th rate of transfer
using simple non-optimized memcpy test on the same underlying hardware). For
reference, the linux
version also declares all its mappings as noncached. I realize this all
begs for a redesign using
DMA but it’s very unlikely to turn out that way.

I would appreciate any suggestions of how to structure this better (without
having to involve DMA)
or any hints how to get better performance.

Thanks in advance.</host_num_buffers>

Thanks for your inputs!

Agreed, the device was not designed with windows OS in mind. In fact the reference design from TI does not include any notion of windows drivers; the reference hardware platform simply runs linux on the ARM core and connects to various external peripheral devices. This SoC was very much designed for an embedded market. The PCIe interface seems to be an entirely a marketing afterthought.

In our custom application (not Windows/DirectX video), we are using the DM8168 as a multi-channel surveillance video recorder that streams 16 channels of compressed H.264 video at 30 fps over the PCIe bus to a host OS. The host can be Linux or Windows. The compressed frames are saved to disk and re-sent to remote clients if needed. The Linux reference design from TI includes a memory-mapped driver and Video-4-Linux (V4L) layer that is functional. So of course this reference design gets propagated to Windows for our case now…

The DM8168 SoC supports the notion of ‘outbound transfers’ whereby it can internally DMA video buffers to a specific range of addresses within its PCIE subsystem (PCIESS). The PCIESS has some configuration registers that can be setup with physical addresses from the host OS, that correspond to these ‘outbound’ areas. If the PCIESS observes a memory write occurring within any of these outbound ranges, it will propagate the write over the PCIe bus to the configured host physical address. So this is essentially an ongoing series of DMA from the device to the host, but not synchronized in any way.

There are 32 of these outbound areas available in the PCIESS hardware, and they can each be configured to map to a different physical address on the host. However, all 32 outbound window sizes must be the same: 1,2,4, or 8MB for each. So that is where the 256MB memory resource descriptor comes from (which is impossible to map as you mentioned). In addition, there are hardware restrictions imposed on mapping these PCIESS outbound areas to host physical memory: Each host physical address must be 1MB aligned, and each window must point to contiguous physical memory (much like DMA).

So on the host side this becomes shared memory. In the windows driver I must allocate chunks of physically contiguous, 1M aligned memory to satisfy the hardware restriction. Of course I found early that its impossible to get it all together as one chunk; I have to break up the physical allocations into smaller MDLs so they have a better chance of succeeding. The user space application wants to see this space as virtually contiguous, so I must recombine them into one MDL before mapping to user space. In user space is where the synchronization lock is implemented between host and device. This is a Peterson’s algorithm 2-process lock, and of course the lock data structures reside inside this shared area.

MM, you mentioned that “Physical continuity is only relevant when a device that does not reference the page tables (not a CPU) makes memory modifications or reads.” This is indeed the case here as far as I can tell. If one temporarily disregards the locking scheme, it means that the device has direct access to modify physical memory on the host, which is why the shared memory is declared as non-cached.

AG, you mentioned that “If it’s not using DMA, then it doesn’t really need contiguous physical memory.
If it really needs it, then it’s actually using DMA.” This is in essence what is happening. The device is a bus master, so a DMA must be happening here. But I have not setup my driver with a DMA Adapter extension and I am not creating or initializing any kind of DMA transactions. I have not explored that area of WDF drivers yet.

AG also mentioned “You CANNOT JUST COPY PFN array. Create and MDL for non-paged address (IoAllocateMdl/MmBuildMdlForNonPagedPool) and have it mapped to user mode.” This is indeed possible for each outbound window, but is it possible to join them as contiguous virtual region in user space? As I mentioned in my first post, copying the PFN tables from smaller MDLs into larger one seems to work, but I fear is treading on thin ice… I don’t know why yet. I keep the smaller MDLs in the deviceContext so that they all remain locked down and their lifetime is guaranteed. The large MDL that contains these copied PFN array is mapped with MmMapLockedPagesSpecifyCache before giving to user space. MmMapLockedPagesSpecifyCache requires a PMDL as input, which “…must describe physical pages that are locked down”, which I have already done with each smaller MDL. So I don’t understand yet why copying the PFN arrayis a bad thing, if it is already locked and the original locking MDL does not go out of scope.

>it means that the device has direct access to modify physical memory on the host, which is why the shared memory is declared as non-cached.

No. In IA32, you don’t need DMA memory to be non-cached.

If you want 1MB contiguous chunks, you need to put that to SkipBytes, and specify the total size desired (32MB). You’ll get memory which is guaranteed to have 1MB contiguous chunks, in one MDL which you can map later to the contiguous usermode address.

>So I don’t understand yet why copying the PFN arrayis a bad thing, if it is already locked and the original locking MDL does not go out of scope.

It may mess the lock count.

>(AA) “If you want 1MB contiguous chunks, you need to put that to SkipBytes, and
specify the total size desired (32MB). You’ll get memory which is guaranteed to
have 1MB contiguous chunks, in one MDL which you can map later to the contiguous
usermode address.”

This is a good suggestion, better than my PFN array copy method-- but how can I get the individual physical addresses that I need to configure the PCIESS outbound mapping registers? Is it just as simple as MmGetPhysicalAddress(baseVaPtr + outboundSize*outboundIndex)? The msdn doc says “Do not use this routine to obtain physical addresses for use with DMA operations.” So I had stayed away from using it.

>how can I get the individual physical addresses that I need to configure the PCIESS outbound mapping registers?

You can use old good GetScatterGatherList.

I would recommend that you at least point out that this is an inherently
unsafe design for a non-embedded environment. Direct synchronization
between a UM app and a device might seem to be fine if you have complete
control but when zero or 42 instances of that app are running as multiple
users (terminal services / fast user switching) it doesn’t seem so good.
even if the UM process is a service, and so multiple access is handled for
you by the SCM, there is nothing to prevent malware from attacking the KM
interface (your ACL may be replaced by a sufficiently privileged attacker).
Even if we are only discussing a buggy UM app, the possibility to screw up
hardware from UM runs counter to the basic design principles of all modern
general purpose operating systems - and lots of other ones too

I recommend that you build a state machine in your driver to handle the
oddities of your particular device (DMA with nonstandard sync) and then
present an interface to your application. If it has been decided that this
must be shared memory, at least make the sync between the device and driver
and then separately between the driver and application

I think you have a long road ahead on this project

wrote in message news:xxxxx@ntdev…

Thanks for your inputs!

Agreed, the device was not designed with windows OS in mind. In fact the
reference design from TI does not include any notion of windows drivers; the
reference hardware platform simply runs linux on the ARM core and connects
to various external peripheral devices. This SoC was very much designed for
an embedded market. The PCIe interface seems to be an entirely a marketing
afterthought.

In our custom application (not Windows/DirectX video), we are using the
DM8168 as a multi-channel surveillance video recorder that streams 16
channels of compressed H.264 video at 30 fps over the PCIe bus to a host OS.
The host can be Linux or Windows. The compressed frames are saved to disk
and re-sent to remote clients if needed. The Linux reference design from TI
includes a memory-mapped driver and Video-4-Linux (V4L) layer that is
functional. So of course this reference design gets propagated to Windows
for our case now…

The DM8168 SoC supports the notion of ‘outbound transfers’ whereby it can
internally DMA video buffers to a specific range of addresses within its
PCIE subsystem (PCIESS). The PCIESS has some configuration registers that
can be setup with physical addresses from the host OS, that correspond to
these ‘outbound’ areas. If the PCIESS observes a memory write occurring
within any of these outbound ranges, it will propagate the write over the
PCIe bus to the configured host physical address. So this is essentially
an ongoing series of DMA from the device to the host, but not synchronized
in any way.

There are 32 of these outbound areas available in the PCIESS hardware, and
they can each be configured to map to a different physical address on the
host. However, all 32 outbound window sizes must be the same: 1,2,4, or 8MB
for each. So that is where the 256MB memory resource descriptor comes from
(which is impossible to map as you mentioned). In addition, there are
hardware restrictions imposed on mapping these PCIESS outbound areas to host
physical memory: Each host physical address must be 1MB aligned, and each
window must point to contiguous physical memory (much like DMA).

So on the host side this becomes shared memory. In the windows driver I
must allocate chunks of physically contiguous, 1M aligned memory to satisfy
the hardware restriction. Of course I found early that its impossible to
get it all together as one chunk; I have to break up the physical
allocations into smaller MDLs so they have a better chance of succeeding.
The user space application wants to see this space as virtually contiguous,
so I must recombine them into one MDL before mapping to user space. In
user space is where the synchronization lock is implemented between host and
device. This is a Peterson’s algorithm 2-process lock, and of course the
lock data structures reside inside this shared area.

MM, you mentioned that “Physical continuity is only relevant when a device
that does not reference the page tables (not a CPU) makes memory
modifications or reads.” This is indeed the case here as far as I can tell.
If one temporarily disregards the locking scheme, it means that the device
has direct access to modify physical memory on the host, which is why the
shared memory is declared as non-cached.

AG, you mentioned that “If it’s not using DMA, then it doesn’t really need
contiguous physical memory.
If it really needs it, then it’s actually using DMA.” This is in essence
what is happening. The device is a bus master, so a DMA must be happening
here. But I have not setup my driver with a DMA Adapter extension and I am
not creating or initializing any kind of DMA transactions. I have not
explored that area of WDF drivers yet.

AG also mentioned “You CANNOT JUST COPY PFN array. Create and MDL for
non-paged address (IoAllocateMdl/MmBuildMdlForNonPagedPool) and have it
mapped to user mode.” This is indeed possible for each outbound window, but
is it possible to join them as contiguous virtual region in user space? As
I mentioned in my first post, copying the PFN tables from smaller MDLs into
larger one seems to work, but I fear is treading on thin ice… I don’t know
why yet. I keep the smaller MDLs in the deviceContext so that they all
remain locked down and their lifetime is guaranteed. The large MDL that
contains these copied PFN array is mapped with MmMapLockedPagesSpecifyCache
before giving to user space. MmMapLockedPagesSpecifyCache requires a PMDL
as input, which “…must describe physical pages that are locked down”, which
I have already done with each smaller MDL. So I don’t understand yet why
copying the PFN arrayis a bad thing, if it is already locked and the
original locking MDL does not go out of scope.