This is a very confusing question. I think it is confusing because you
are confused.
You never, ever pass a virtual pointer to a DMA device (there is no “the”
DMA engine; each DMA device is a bus master, thus there are as many “DMA
engines” as there are devices).
A pointer + length spans many pages; a pointer is just a pointer, and is
the address of the first byte of the buffer. The pages are always mapped
1:1 to physical addresses.
The problem is that if you pass in a 1GB buffer this requires, using
Direct I/O, that 1GB of memory will be locked down. You may not have
enough physical memory to accomplish this. There is a solution, but first
there are a few more pieces of confusion to clear up.
Whenever you allocate pages with VirtualAlloc, it doesn’t really matter if
the pages are regular 4K pages or something else; a 1GB buffer will
consume one GB of memory when it is locked down, no matter how large the
“pages” are. Page size at that point is irrelevant. The memory is locked
not when you VirtualAlloc, but when the I/O Manager has processed the
request. As part of its action, it creates a MDL, then does
MmProbeAndLockPages to lock all the pages down. Two things immediately
follow: (1) if there is not enough physical memory available to lock the
buffer down, the I/O Manager will fail with an “insufficient kernel
resources” error, and (2) if it succeeds, you have consumed a lot of
memory that will stay locked until the I/O completes. Furthermore, as a
consequence of (2), some significant number of pages must be either paged
out (if they are writable and have been modified) or the frame is freed up
and the page is marked for page faulting (if they are read-only or are
read//write and have not been modified. This is a potentially serious
impact on overall system performance, particularly if the percentage of
pages (of any size) that are locked down is a significant percentage of
the total available memory.
There is a solution: mode Neither.
In this mode, all you get is the UserBuffer address and the length.
Period. You have to create the MDL. However, what you do is defer the
MmProbeAndLockPages, so the pages remain pageable until you need them.
When you are doing DMA, one of the parameters that comes back to you is
the number of (4K) address chunks that the OS is willing to allow you to
use for your DMA transaction. There are many parameters that determine
this value, such as the total available space, whether you have a
32-bit-address card on a 36-bit-or-higher physical bus, and if so, the
limit may be determined by the number of DMA-mapping registers available
that translate 32-bit DMA addresses to wider-sized addresses, or if the OS
has to copy your data that is above the 32-bit physical address limit into
or out of “bounce buffers”, buffers that are allocated below the 4GB
physical address boundary.
Once you know this value, what you do is create a partial MDL based on the
input MDL you created, and do an MmProbeAndLockPages yourself (always
being prepared for failure because this is the point at which the
existence of the user buffer is determined). You then do DMA on the
partial MDL. When the DMA completes, you relelase the locked pages and
advance the partial MDL to the next block of addresses to write out,
lather, rinse, repeat.
Sounds simple enough. But there’s a real clinker here: the post-DMA
completion is almost certainly executed at DISPATCH_LEVEL, and MmPALP is
going to bluescreen if it has to bring a page in. So you have to farm out
the MmPALP to a PASSIVE_LEVEL thread that does the MmPALP, properly
handles any errors that may occur, and initiates the next transfer. Ouch!
BUT: Tony Mason wrote an article in the OSR Newsletter sometime in the
recent past (the last two years for me have been compressed in various
ways, and I lose track of whether it was three months back or eighteen
months back) explaining all this. You need to read it. It can be found
in the OSR archives.
Note that there is another solution. Don’t present your user with
ReadFile and WriteFile as the I/O primitives. Present your user with a
DLL you wrote. The DLL works along these lines: attempt to write the
whole buffer out. If that fails because of insufficient resources, issue
two writes for half-a-buffer. If the first of these fails, it means you
will need, in principle, four writes for a quarter-buffer. Subdivide
until it works. Note that you may, /a priori/, decide that you will never
write a chunk > 100MB or something like that. The advantage of this
approach is that it is basically very easy to write, compared to a driver
that can handle arbitrarily large buffers. Its performance is slightly
worse because you may have those failed writes at the start of the
transaction, but you can decide that once a proper fractional value is
determined, all subsequent writes will start with that first “best guess”.
So you don’t do this for EVERY write, just the first few attempts until
you get to a size that doesn’t fail. Your control of memory pressure is
how many bytes you limit each transaction to (in the DLL) in order to
satisfy the user’s request. And yes, it has a few extra round trips
kernel-and-back, but these are amazingly fast compared to the cost of a
single page fault so they ultimately don’t matter…
I said “write”, but “read” is handled symmetrically. I didn’t feel like
going back and rewording the description to say “transfer” or “read/write”
everywhere; doing this is Left As An Exercise For The Reader [LAAEFTR]
Basically, VirtualAlloc is going to allocate as many bytes as you request
(or fail because there is no space left, or not enough contiguous slots in
your process’s memory map). You can’t “encourage” it to allocate fewer
bytes that the request, because, by definition, all those bytes are
mapped. You can’t have “fewer mapped buffers” because that concept does
not even exist. You call VirtualAlloc and ask for a bunch-o-bytes. It
chooses a virtual address in your process map that is defined as the
starting address of the buffer (and, by the way, it will be a page-aligned
address, and more than that, it will be aligned to the “allocation
granularity”, which is nominally 64K (there is an API that tells you this
value, but it has been 64K for decades. But it could change tomorrow).
There must be sufficient contiguous slots in the process’s memory map to
span from the starting address for the length. If the request cannot be
satisfied, VirtualAlloc returns NULL. But if the request IS satisfied,
those pages are NOT locked down in any way; they are completely pageable
until either the I/O Manager (for Direct I/O) or your driver (for Neither
I/O) lock down the pages. At that point, they become brevet
non-paged-pool pages until the transaction (or, as I describe above, each
partial transaction) completes. Then they are once more pageable. The
problem arises if there is insufficient physical memory to lock the buffer
down; for example, using a 1GB user buffer on a 1GB machine; any attempt
to lock down the entir buffer will fail, even though, logically, the user
can access the entire 1GB.
Note also that any attempt to allocate a massive buffer in your app can
fail, because even though there are potentially enough page frames in
memory to hold the data (even if some of them are residing temporarily in
pagefile.sys), there is not enough available address space in your process
to satisfy the request. This is much less an issue on Win64 machines
(even if it is a Win32 app), but it is a common problem in Win32 apps
running on a Win32 machine. Note that to take advantage of the extended
user space available to a Win32 app running on either a 3GB-user-partition
Win32 system [which is not the default for client machines] or under
WoW64, you must link it with the /LARGEADDRESSAWARE linker option.
Otherwise, it will be limited to 2GB of available process space. [I think
I remembered the name of that option correctly, but that’s why there are
manuals…once you have the core idea, the rest is just syntax, and that
is LAAEFTR]
I think you need to review the concepts of how virtual memory works. You
need to understand how memory maps work, how page-locking works, and how
DMA works. From the question, I have formed the impression that you do
not understand these very well, if at all.
joe
Hi,
A virtual pointer is passed to the dma engine. If the pointer spans many
physical pages, then the mapped buffers may be 1-to-1 for each page. Is it
possible to encourage the memory manager to allocate fewer mapped buffers?
In Win32 API, VirtualAlloc(MEM_LARGE_PAGES) gives mapped buffers of at
least 2MB in size. The downside is that the memory is locked. For this
particular project, it is useful to allow huge application buffers (> 1GB)
to be swapped and have large dma buffers at the right time.
Thank you,
Calin
NTDEV is sponsored by OSR
Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev
OSR is HIRING!! See http://www.osr.com/careers
For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars
To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer