Hello ntdev list:
I am a first time windows driver dev, seeking help from experienced driver writers such as yourselves.
I am also newly joined to ntdev and this is my first question after several months perusing the list archives.
My question is about bandwidth performance of memory-mapped I/O regions between device and host, and whether
I am setting these up correctly for my application. I understand that “Mapping kernel memory to user space
is a complex and error-prone technique that in general should be ignored as a possible solution” but that was
the design decision that flowed down to me. I hope this is not too much tl;dr …
Driver project background:
- A custom PCIe analog video capture board based on TI DM8168.
- Using KMDF/vs2012 to develop the windows driver. A linux driver is available as a reference design.
- Ring buffers containing video data are mapped to shared user memory on the host; not using DMA.
- The board PCIe configuration supports the concept of mapping host physical addresses as inbound and outbound.
- Inbound memory resides on the board, Outbound resides on the host.
- Mapped regions must be contiguous and 1M aligned.
- There are 32 available outbound mapping slots, size 1|2|4|8 MB each. We are setting them all to 1M and combining together in userland.
Using the msdn WDF online documentation and driver example code, I was able to put together a functional windows
driver that can load firmware (linux uboot+kernel+fs) to the device and boot it. The resource list that is presented
by PnP manager consists of 4 BARs:
Bar0 : 4k, pci config space
Bar1 : 8M, memory
Bar2 : 16M, memory
Bar3 : 256M, memory
I am able to use MmMapIoSpace on bars 0-2, but mapping bar3 usually fails due to insufficient resources so I don’t map it.
I would like to use a portion of bar3 as host memory (outbound for the device) but I don’t know how to do that
(my understanding is that MmMapIoSpace does not produce system virtual addresses that represent physical host addresses).
Getting the outbound video streams is another matter. In the linux reference driver, this is done by allocating numerous
contiguous 1M physical chunks (1M aligned) and stitching them together into a single virtual address. This is because
obtaining a larger contiguous/aligned section of physical memory is hard to do if the driver gets reloaded too many times,
due to fragmentation. Also, each outbound mapping register on the device requires a host physical address.
In my windows driver I tried to something similar like this (error handling omitted for clarity):
const SIZE_T pagesPerHostBuffer = BYTES_TO_PAGES(1*1024*1024);
const SIZE_T pfnSize = sizeof(PFN_NUMBER) * pagesPerHostBuffer;
// Allocate large MDL which will be mapped to user space
PMDL pHostMdl = IoAllocateMdl(
NULL, // This MDL will not have any associated system virtual address.
hostMdlSize, // Comprises the entire group of host memory regions.
FALSE, // Not associated with Irp, not part of an MDL chain.
FALSE, // charegQuota always false
NULL // No associated Irp
);
// Allocate and map smaller MDL aligned buffers
for (ULONG index=0; index<host_num_buffers> {
PMDL pMdl = MmAllocatePagesForMdlEx(
LowAddress, // 0
HighAddress, // ~0
SkipBytes, // 110241024
110241024,
MmNonCached,
MM_ALLOCATE_REQUIRE_CONTIGUOUS_CHUNKS
);
PVOID pVirt = MmMapLockedPagesSpecifyCache(
pMdl,
KernelMode,
MmNonCached,
NULL,
FALSE,
HighPagePriority
);
// the source of the physical page data within the small MDL …
PPFN_NUMBER pSrc = MmGetMdlPfnArray(pMdl);
// … will be copied in order to the single large MDL.
PPFN_NUMBER pDst = &MmGetMdlPfnArray(pHostMdl)[index * pagesPerHostBuffer];
RtlCopyMemory(pDst, pSrc, pfnSize);
pHostMdl->MdlFlags = pMdl->MdlFlags;
}
// Finally, Map the large FrankenMDL to user space
PVOID user_addr = MmMapLockedPagesSpecifyCache(
pHostMdl,
UserMode,
MmNonCached,
NULL, // Allow the system to choose the user-mode starting address
FALSE, // Disable BugCheckOnFailure since the mapping is to user mode
HighPagePriority // Indicates the importance of success when PTEs are scarce.
);
This method (kind of pseudo scatter-gather) seems to work in terms of a) not crashing and b) getting
board data to the host, but it really feels like a kludge solution and I have just been lucky so far.
In addition, the performance is miserable compared to the linux driver (about 1/10th rate of transfer
using simple non-optimized memcpy test on the same underlying hardware). For reference, the linux
version also declares all its mappings as noncached. I realize this all begs for a redesign using
DMA but it’s very unlikely to turn out that way.
I would appreciate any suggestions of how to structure this better (without having to involve DMA)
or any hints how to get better performance.
Thanks in advance.</host_num_buffers>