Large DMA map buffers

Calin_Iaru · September 12, 2013, 10:35am

Hi,

? A virtual pointer is passed to the dma engine. If the pointer spans many physical pages, then the mapped buffers may be 1-to-1 for each page. Is it possible to encourage the memory manager to allocate fewer mapped buffers?
? In Win32 API, VirtualAlloc(MEM_LARGE_PAGES) gives mapped buffers of at least 2MB in size. The downside is that the memory is locked. For this particular project, it is useful to allow huge application buffers (> 1GB) to be swapped and have large dma buffers at the right time.?

Thank you,
? Calin

OSR_Community_User · September 12, 2013, 11:51am

This is a very confusing question. I think it is confusing because you
are confused.

You never, ever pass a virtual pointer to a DMA device (there is no “the”
DMA engine; each DMA device is a bus master, thus there are as many “DMA
engines” as there are devices).

A pointer + length spans many pages; a pointer is just a pointer, and is
the address of the first byte of the buffer. The pages are always mapped
1:1 to physical addresses.

The problem is that if you pass in a 1GB buffer this requires, using
Direct I/O, that 1GB of memory will be locked down. You may not have
enough physical memory to accomplish this. There is a solution, but first
there are a few more pieces of confusion to clear up.

Whenever you allocate pages with VirtualAlloc, it doesn’t really matter if
the pages are regular 4K pages or something else; a 1GB buffer will
consume one GB of memory when it is locked down, no matter how large the
“pages” are. Page size at that point is irrelevant. The memory is locked
not when you VirtualAlloc, but when the I/O Manager has processed the
request. As part of its action, it creates a MDL, then does
MmProbeAndLockPages to lock all the pages down. Two things immediately
follow: (1) if there is not enough physical memory available to lock the
buffer down, the I/O Manager will fail with an “insufficient kernel
resources” error, and (2) if it succeeds, you have consumed a lot of
memory that will stay locked until the I/O completes. Furthermore, as a
consequence of (2), some significant number of pages must be either paged
out (if they are writable and have been modified) or the frame is freed up
and the page is marked for page faulting (if they are read-only or are
read//write and have not been modified. This is a potentially serious
impact on overall system performance, particularly if the percentage of
pages (of any size) that are locked down is a significant percentage of
the total available memory.

There is a solution: mode Neither.

In this mode, all you get is the UserBuffer address and the length.
Period. You have to create the MDL. However, what you do is defer the
MmProbeAndLockPages, so the pages remain pageable until you need them.

When you are doing DMA, one of the parameters that comes back to you is
the number of (4K) address chunks that the OS is willing to allow you to
use for your DMA transaction. There are many parameters that determine
this value, such as the total available space, whether you have a
32-bit-address card on a 36-bit-or-higher physical bus, and if so, the
limit may be determined by the number of DMA-mapping registers available
that translate 32-bit DMA addresses to wider-sized addresses, or if the OS
has to copy your data that is above the 32-bit physical address limit into
or out of “bounce buffers”, buffers that are allocated below the 4GB
physical address boundary.

Once you know this value, what you do is create a partial MDL based on the
input MDL you created, and do an MmProbeAndLockPages yourself (always
being prepared for failure because this is the point at which the
existence of the user buffer is determined). You then do DMA on the
partial MDL. When the DMA completes, you relelase the locked pages and
advance the partial MDL to the next block of addresses to write out,
lather, rinse, repeat.

Sounds simple enough. But there’s a real clinker here: the post-DMA
completion is almost certainly executed at DISPATCH_LEVEL, and MmPALP is
going to bluescreen if it has to bring a page in. So you have to farm out
the MmPALP to a PASSIVE_LEVEL thread that does the MmPALP, properly
handles any errors that may occur, and initiates the next transfer. Ouch!

BUT: Tony Mason wrote an article in the OSR Newsletter sometime in the
recent past (the last two years for me have been compressed in various
ways, and I lose track of whether it was three months back or eighteen
months back) explaining all this. You need to read it. It can be found
in the OSR archives.

Note that there is another solution. Don’t present your user with
ReadFile and WriteFile as the I/O primitives. Present your user with a
DLL you wrote. The DLL works along these lines: attempt to write the
whole buffer out. If that fails because of insufficient resources, issue
two writes for half-a-buffer. If the first of these fails, it means you
will need, in principle, four writes for a quarter-buffer. Subdivide
until it works. Note that you may, /a priori/, decide that you will never
write a chunk > 100MB or something like that. The advantage of this
approach is that it is basically very easy to write, compared to a driver
that can handle arbitrarily large buffers. Its performance is slightly
worse because you may have those failed writes at the start of the
transaction, but you can decide that once a proper fractional value is
determined, all subsequent writes will start with that first “best guess”.
So you don’t do this for EVERY write, just the first few attempts until
you get to a size that doesn’t fail. Your control of memory pressure is
how many bytes you limit each transaction to (in the DLL) in order to
satisfy the user’s request. And yes, it has a few extra round trips
kernel-and-back, but these are amazingly fast compared to the cost of a
single page fault so they ultimately don’t matter…

I said “write”, but “read” is handled symmetrically. I didn’t feel like
going back and rewording the description to say “transfer” or “read/write”
everywhere; doing this is Left As An Exercise For The Reader [LAAEFTR]

Basically, VirtualAlloc is going to allocate as many bytes as you request
(or fail because there is no space left, or not enough contiguous slots in
your process’s memory map). You can’t “encourage” it to allocate fewer
bytes that the request, because, by definition, all those bytes are
mapped. You can’t have “fewer mapped buffers” because that concept does
not even exist. You call VirtualAlloc and ask for a bunch-o-bytes. It
chooses a virtual address in your process map that is defined as the
starting address of the buffer (and, by the way, it will be a page-aligned
address, and more than that, it will be aligned to the “allocation
granularity”, which is nominally 64K (there is an API that tells you this
value, but it has been 64K for decades. But it could change tomorrow).
There must be sufficient contiguous slots in the process’s memory map to
span from the starting address for the length. If the request cannot be
satisfied, VirtualAlloc returns NULL. But if the request IS satisfied,
those pages are NOT locked down in any way; they are completely pageable
until either the I/O Manager (for Direct I/O) or your driver (for Neither
I/O) lock down the pages. At that point, they become brevet
non-paged-pool pages until the transaction (or, as I describe above, each
partial transaction) completes. Then they are once more pageable. The
problem arises if there is insufficient physical memory to lock the buffer
down; for example, using a 1GB user buffer on a 1GB machine; any attempt
to lock down the entir buffer will fail, even though, logically, the user
can access the entire 1GB.

Note also that any attempt to allocate a massive buffer in your app can
fail, because even though there are potentially enough page frames in
memory to hold the data (even if some of them are residing temporarily in
pagefile.sys), there is not enough available address space in your process
to satisfy the request. This is much less an issue on Win64 machines
(even if it is a Win32 app), but it is a common problem in Win32 apps
running on a Win32 machine. Note that to take advantage of the extended
user space available to a Win32 app running on either a 3GB-user-partition
Win32 system [which is not the default for client machines] or under
WoW64, you must link it with the /LARGEADDRESSAWARE linker option.
Otherwise, it will be limited to 2GB of available process space. [I think I remembered the name of that option correctly, but that’s why there are manuals…once you have the core idea, the rest is just syntax, and that is LAAEFTR]

I think you need to review the concepts of how virtual memory works. You
need to understand how memory maps work, how page-locking works, and how
DMA works. From the question, I have formed the impression that you do
not understand these very well, if at all.
joe

Hi,

A virtual pointer is passed to the dma engine. If the pointer spans many
physical pages, then the mapped buffers may be 1-to-1 for each page. Is it
possible to encourage the memory manager to allocate fewer mapped buffers?
In Win32 API, VirtualAlloc(MEM_LARGE_PAGES) gives mapped buffers of at
least 2MB in size. The downside is that the memory is locked. For this
particular project, it is useful to allow huge application buffers (> 1GB)
to be swapped and have large dma buffers at the right time.

Thank you,
Calin

NTDEV is sponsored by OSR

Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

OSR_Community_User · September 12, 2013, 12:04pm

One caution I forgot: once you choose to go to mode Neither, you enter a
weird and wondrous world sprinkled liberally with landmines and quicksand.
Liberally interlaced, which means that while you are sinking in the
quicksand, you may find that you are sharing it with an activated
landmine. The result Will Not Be A Pretty Sight. With your current
understanding, you are about to enter a world of hurt. Mode Neither is
not for the inexperienced. It requires a deep understanding of how all
those memory management mechanisms work. That;s why I suggested the DLL
approach. It is far, far easier to write, and its performance will not
differ significantly from the approach of doing it in the driver.

The main paradigm shift is that you have to stop thinking of
ReadFile/WriteFile/DeviceIoControl as the presentation of your device
interface to the users of the device. ReadFile/WriteFile/DeviceIoControl
become the “assembly code” of I/O, and you want to present your users with
a high-level-language interface. This, too, is not a trivial task if you
plan to allow them to use asynch (overlapped) I/O, but even then, it is
still simpler than adding full support for massive buffers to your driver.
Note that async I/O is most definition NOT a reworking of
ReadFileEx/WriteFileEx (the callback versions, which I think are the worst
possible way to implement async I/O), but you need to support the notion
of WaitFor… with handles and/or I/O Completion Ports (which I think are
the only sane way to do it, because they scale easily for multiple cores
and multiple pending transactions, without any effort on your part).
joe

Hi,

A virtual pointer is passed to the dma engine. If the pointer spans many
physical pages, then the mapped buffers may be 1-to-1 for each page. Is it
possible to encourage the memory manager to allocate fewer mapped buffers?
In Win32 API, VirtualAlloc(MEM_LARGE_PAGES) gives mapped buffers of at
least 2MB in size. The downside is that the memory is locked. For this
particular project, it is useful to allow huge application buffers (> 1GB)
to be swapped and have large dma buffers at the right time.

Thank you,
Calin

NTDEV is sponsored by OSR

Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

Tim_Roberts · September 12, 2013, 12:57pm

Calin Iaru wrote:

A virtual pointer is passed to the dma engine. If the pointer spans
many physical pages, then the mapped buffers may be 1-to-1 for each
page. Is it possible to encourage the memory manager to allocate fewer
mapped buffers?

I’m not entirely sure what you mean by “mapped buffers”. Your second
paragraph makes me think you’re just talking about physical pages. If
that’s so, then there’s really no way to do what you want. You can’t
really write an algorithm to “encourage” something. You either need
contiguous pages, or you don’t. How would you decide how many was too many?

And what would be the incentive? If your driver doesn’t do
scatter/gather, then you have to have contiguous physical memory, and
you’ll probably need to use a common buffer. If your driver does
scatter/gather, then the only cost is the creating of a longer
scatter/gather list. That’s a trivial cost.

–
Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

OSR_Community_User · September 12, 2013, 4:45pm

> Calin Iaru wrote:

>
> A virtual pointer is passed to the dma engine. If the pointer spans
> many physical pages, then the mapped buffers may be 1-to-1 for each
> page. Is it possible to encourage the memory manager to allocate fewer
> mapped buffers?

I’m not entirely sure what you mean by “mapped buffers”. Your second
paragraph makes me think you’re just talking about physical pages. If
that’s so, then there’s really no way to do what you want. You can’t
really write an algorithm to “encourage” something. You either need
contiguous pages, or you don’t. How would you decide how many was too
many?

And what would be the incentive? If your driver doesn’t do
scatter/gather, then you have to have contiguous physical memory, and
you’ll probably need to use a common buffer. If your driver does
scatter/gather, then the only cost is the creating of a longer
scatter/gather list. That’s a trivial cost.

My analysis of the post suggests that he is very confused over the
differences between mapped memory in user space, physical space in the
memory, and how the kernel connects the two, and how, why, and when pages
become locked. And confusion about the difference between mapping and
locking.

A DMA device that doesn’t do s/g is an abomination to nature. Or “If
had meant for physical pages to be contiguous for
I/O, wouldn’t have given us the concept of scatter/gather”. Of
course, advocates of s/g don’t always get welcomed; look what happened to
Prometheus (he was de-livered by day, sort of an inverse FedEx).

I think there is some confusion about what is meant by contiguous in
mapped memory and contiguous in physical memory.
joe
>
> –
> Tim Roberts, xxxxx@probo.com
> Providenza & Boekelheide, Inc.
>
>
> —
> NTDEV is sponsored by OSR
>
> Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev
>
> OSR is HIRING!! See http://www.osr.com/careers
>
> For our schedule of WDF, WDM, debugging and other seminars visit:
> http://www.osr.com/seminars
>
> To unsubscribe, visit the List Server section of OSR Online at
> http://www.osronline.com/page.cfm?name=ListServer
>

Peter_Viscarola_OSR · September 12, 2013, 4:53pm

“A DMA device IN A TYPICAL PC OR SERVER that doesn’t do s/g is an abomination to nature.”

Fixed it for you.

Don’t forget it’s 2013: There will be more tablets sold this year than PCs. And then there’s phones. ALL these things run Windows, and many devices on these systems support some sort of DMA. Phones, for example, don’t support Bus Master DMA.

We must be careful of sweeping statements that don’t apply in the new world of devices.

Peter
OSR

OSR_Community_User · September 12, 2013, 6:01pm

I guess I wasn’t aware that phones or tablets did DMA at all! (What do we
have? Can you say without violating NDA? Do we have “System DMA” all
over again?)

The last sweeping statement I made involved getting all the leaves off the
sidewalk and driveway…
joe

“A DMA device IN A TYPICAL PC OR SERVER that doesn’t do s/g is an
abomination to nature.”

Fixed it for you.

Don’t forget it’s 2013: There will be more tablets sold this year than
PCs. And then there’s phones. ALL these things run Windows, and many
devices on these systems support some sort of DMA. Phones, for example,
don’t support Bus Master DMA.

We must be careful of sweeping statements that don’t apply in the new
world of devices.

Peter
OSR

NTDEV is sponsored by OSR

Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

anton_bassov · September 12, 2013, 6:30pm

> What do we have? Can you say without violating NDA? Do we have “System DMA” all over again?

Just download ANY ARM datasheet from arm.com and you will see that ARM devices use on-board DMA controller, i.e. something that is considered extinct in x86 world…

Anton Bassov

Calin_Iaru · September 13, 2013, 10:21am

Hi all,

? Thanks for the fb - though extensive, I will come with a clarification. The deal is to give the DMA engine large Logical Addresses. I was interested to get a confirmation or disproof regarding the IOMMU or VT-d involvement. Rumor has it that with those in place, it’s possible to do clever things and for a random application buffer that spans many physical pages which are locked, one can get a few or one (!) logical address.

Regards,
? Calin

Peter_Viscarola_OSR · September 13, 2013, 10:45am

Hmmm… Well, Windows has had the concepts of abstract DMA address translation in place since the beginning. But pages are pages, right? If the data is going to be DMA’ed, the buffer containing that data has to be in memory… and in order to ensure it remains in memory the pages comprising that buffer must remain locked for the duration of the DMA transfer. Nothing about IOMMU or anything else is going to change that – unless you want to count COPYING the data to a separate buffer which itself must remain locked for the duration of the transfer.

In Windows, if you have a non-contiguous data buffer and you declare your DMA device as NOT supporting scatter/gather Windows will make your data buffer contiguous through the use of what we call System Scatter/Gather. I wrote an article about this a few years back here:

http://www.osronline.com/article.cfm?article=539

Peter
OSR

OSR_Community_User · September 13, 2013, 5:18pm

> Hi all,

Thanks for the fb - though extensive, I will come with a clarification.
The deal is to give the DMA engine large Logical Addresses. I was
interested to get a confirmation or disproof regarding the IOMMU or VT-d
involvement. Rumor has it that with those in place, it’s possible to do
clever things and for a random application buffer that spans many physical
pages which are locked, one can get a few or one (!) logical address.

I have no idea what you mean by “logical addresses”. Physical addresses
have always been 64 bits, and mapped addresses come in only two sizes:
32-bit addresses for Win32 and 64-bit addresses for Win64. DMA always
requires physical addresses. If your DMA device was designed by
{deprecating term of your choice here, starting with “fools” and
progressing to both more insulting and more colorful terms} and does not
support scatter-gather, then you rely on concepts like common buffers and
OS-supplied “bounce buffers”, which require that you be able to map part
of your application buffer to kernel space so you can use that virtual
address to copy from/to the contiguous buffer the hardware requires. The
target pages of the DMA must be locked, and if you have to do the copy
from DISPATCH_LEVEL, then the source pages must be locked down as well.
You seem to be fixating on mechanisms outside your area of concern, such
as “large pages” and IOMMUs. These are magic.

Note that if your device does NOT support s/g, you can still request the
list and simulate the DMA by programming your device one entry at a time,
but use of a common buffer, or relying on the built-in “bounce buffer”
mechanism will be easier and more efficient. You have not told us if your
device has s/g capability, and if it doesn’t, the designers need some
serious consciousness-raising about the state of technology in 2013.

What problem do you think you are trying to solve?
joe

Regards,
Calin

NTDEV is sponsored by OSR

Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

Maxim_S_Shatskih · September 13, 2013, 5:34pm

The usual way is to properly support SG in the device DMA hardware.

Another ways are a) using copy bounce buffer - it is just a question of providing ScatterGather=FALSE to IoGetDmaAdapter and b) using IOMMU (will be automatically used if any).

–
Maxim S. Shatskih
Microsoft MVP on File System And Storage
xxxxx@storagecraft.com
http://www.storagecraft.com

“Calin Iaru” wrote in message news:xxxxx@ntdev…
Hi,

A virtual pointer is passed to the dma engine. If the pointer spans many physical pages, then the mapped buffers may be 1-to-1 for each page. Is it possible to encourage the memory manager to allocate fewer mapped buffers?
In Win32 API, VirtualAlloc(MEM_LARGE_PAGES) gives mapped buffers of at least 2MB in size. The downside is that the memory is locked. For this particular project, it is useful to allow huge application buffers (> 1GB) to be swapped and have large dma buffers at the right time.

Thank you,
Calin

Tim_Roberts · September 13, 2013, 5:50pm

xxxxx@flounder.com wrote:

I have no idea what you mean by “logical addresses”. Physical addresses
have always been 64 bits, and mapped addresses come in only two sizes:
32-bit addresses for Win32 and 64-bit addresses for Win64. DMA always
requires physical addresses.

Microsoft uses the terms “logical address” to refer to the bus-specific
address you get from the DMA abstraction, which may in fact be different
from the “physical address” used by the processor.

Your “mapped address” is what Microsoft calls a “virtual address”.

–
Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

Maxim_S_Shatskih · September 13, 2013, 5:56pm

IOMMU is exactly intended for this.
“Calin Iaru” wrote in message news:xxxxx@ntdev…
Hi all,

Thanks for the fb - though extensive, I will come with a clarification. The deal is to give the DMA engine large Logical Addresses. I was interested to get a confirmation or disproof regarding the IOMMU or VT-d involvement. Rumor has it that with those in place, it’s possible to do clever things and for a random application buffer that spans many physical pages which are locked, one can get a few or one (!) logical address.

Regards,
Calin

Maxim_S_Shatskih · September 13, 2013, 5:57pm

> Microsoft uses the terms “logical address” to refer to the bus-specific

address you get from the DMA abstraction, which may in fact be different
from the “physical address” used by the processor.

Yes, it can be, for instance, some reference to IOMMU’s window.

–
Maxim S. Shatskih
Microsoft MVP on File System And Storage
xxxxx@storagecraft.com
http://www.storagecraft.com

Maxim_S_Shatskih · September 13, 2013, 5:58pm

>IOMMU or VT-d involvement.

BTW - good old AGP GART is a kind of IOMMU.

AGP440 source was once provided with DDK. It is the DMA adapter object implementation around this IOMMU.

And computers like Sun Sparc used IOMMU in early 1990ies.

–
Maxim S. Shatskih
Microsoft MVP on File System And Storage
xxxxx@storagecraft.com
http://www.storagecraft.com

Calin_Iaru · September 13, 2013, 7:01pm

Hi Maxim, Peter,

? I’ve been looking for information regarding DMAr.?Yes, the device does support SG. The problem is this: transfer application buffers with DMA. The buffers can be huge files mapped into memory. The 4K pages will require lots of relatively small transfers. MEM_LARGE_PAGES or SEC_LARGE_PAGES will lock the memory which is undesirable for scenarios where the memory is limited.

Regards,
? Calin

From: Maxim S. Shatskih
To: Windows System Software Devs Interest List
Sent: Friday, September 13, 2013 11:55 PM
Subject: Re:[ntdev] Large DMA map buffers

??? IOMMU is exactly intended for
this.
“Calin Iaru” wrote in message news:xxxxx@ntdev…
>Hi all,
>
>
>? Thanks for the fb - though extensive, I will come with a clarification. The deal is to give the DMA engine large Logical Addresses. I was interested to get a confirmation or disproof regarding the IOMMU or VT-d involvement. Rumor has it that with those in place, it’s possible to do clever things and for a random application buffer that spans many physical pages which are locked, one can get a few or one (!) logical address.
>
>
>Regards,
>? Calin
—
NTDEV is sponsored by OSR

Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer

Tim_Roberts · September 13, 2013, 8:37pm

Calin Iaru wrote:

I’ve been looking for information regarding DMAr. Yes, the device
does support SG. The problem is this: transfer application buffers
with DMA. The buffers can be huge files mapped into memory. The 4K
pages will require lots of relatively small transfers.

I think you are worrying about nothing. The only cost is sending a
larger scatter/gather list to the hardware. On PCIExpress, a single
transfer cannot be larger than 4k bytes anyway, so you aren’t losing any
performance.

–
Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

OSR_Community_User · September 14, 2013, 12:12am

> Hi Maxim, Peter,

I’ve been looking for information regarding DMAr. Yes, the device does
support SG. The problem is this: transfer application buffers with DMA.
The buffers can be huge files mapped into memory. The 4K pages will
require lots of relatively small transfers. MEM_LARGE_PAGES or
SEC_LARGE_PAGES will lock the memory which is undesirable for scenarios
where the memory is limited.

Regards,
Calin

If you have s/g, what does it matter? The issue has nothing to do with
how the bits get into memory: they can be large memory-mapped files, or
random bits filled in by a random number generator. The source of the
bits is irrelevant. What is surprising here is that you are now talking
about “memory-mapped-files” and in an earlier message you were talking
about using VirtualAlloc. These are separate, and to the programmer,
unrelated concepts (while internally they are both mediated by te VM
Manager, at the app level you don’t know that. And at the driver level,
you can’t tell where the bits came from, so it can’t matter). Note that
in Win32, it doesn’t matter how “huge” the files are (“huge” files are in
the tens-of-gigabyte and larger range; it is amazing how many people use
vague and completely meaningless words like “huge” without quantifying
them; one post classified “huge” as “10MB” which, in the Windows world,
counts as somewhere between “small” and “tiny”). That said, if you are
using memory-mapped files, you cannot map any view too large to represent
in your memory map; you need enough contiguous slots for each mapped view
of the file to inhabit contiguous virtual addresses in your process
address space). On Win32, this will frequently limit your largest mapped
view to well under a gigabyte; the limit will change based on how
fragmented your process map has become, which means the limit can change
while your program is running.

What you have missed is that with s/g, there IS no delay in switching from
page to page. So in addition to your basic confusion about memory, mapped
memory, locked memory, and the addresses you present to your DMA hardware,
you need to understand how s/g actually works, because the whole purpose
of s/g hardware is to reduce the cost of reading or writing discontiguous
pages to ZERO. So your concerns make no sense in that context. I’m not
sure what you trying to accomplish, but you have not made anything clear
in these questions. It sounds like you are trying to “optimize” something
in the total absence of any performance data, and furthermore, to
“optimize” something that has ALREADY been optimized for decades. Why do
you think s/g exists? Precisely to optimize the cost of large,
discontiguous transfers so they are identical in cost to contiguous
transfers! So when it comes to the basic question “what are you trying to
accomplish?” the answer seems to be “I want to violate every abstraction
provided to me by the hardware and OS in a futile attempt to solve a
problem that was solved in the 1960s, when IBM introduced the concept of
‘chained channel-command words’ and which PC hardware vendors solved on
the order of 25 years ago with the concept of ‘scatter/gather’.” You are
obsessive about solving a nonexistent problem, and it is not helped by
your confused understanding of the basic mechanisms of memory management,
which are no different in modern PCs than they were when I programmed
IBM’s 360/67 in 1968. The solutions are equivalent, and in 45 years have
become quite mature. You need to first need to understand, in depth, what
those mechanisms are before you run off and try to “improve” them with
weird hacks. It’s OK to ask questions to learn what those mechanisms are,
but you need to do that before making statements about how you want to
play with the innards of Windows, going well below the documented
abstractions, to solve nonexistent problems based on a confused
understanding of what is going on. You have provided not a single piece of
data to show that (a) any problem exists and (b) your weird hacks will
produce a better solution to the problem.

joe

From: Maxim S. Shatskih
> To: Windows System Software Devs Interest List
> Sent: Friday, September 13, 2013 11:55 PM
> Subject: Re:[ntdev] Large DMA map buffers
>
>
>
>
> IOMMU is exactly intended for
> this.
> “Calin Iaru” wrote in message news:xxxxx@ntdev…
>>Hi all,
>>
>>
>> Thanks for the fb - though extensive, I will come with a
>> clarification. The deal is to give the DMA engine large Logical
>> Addresses. I was interested to get a confirmation or disproof regarding
>> the IOMMU or VT-d involvement. Rumor has it that with those in place,
>> it’s possible to do clever things and for a random application buffer
>> that spans many physical pages which are locked, one can get a few or
>> one (!) logical address.
>>
>>
>>Regards,
>> Calin
> —
> NTDEV is sponsored by OSR
>
> Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev
>
> OSR is HIRING!! See http://www.osr.com/careers
>
> For our schedule of WDF, WDM, debugging and other seminars visit:
> http://www.osr.com/seminars
>
> To unsubscribe, visit the List Server section of OSR Online at
> http://www.osronline.com/page.cfm?name=ListServer
> —
> NTDEV is sponsored by OSR
>
> Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev
>
> OSR is HIRING!! See http://www.osr.com/careers
>
> For our schedule of WDF, WDM, debugging and other seminars visit:
> http://www.osr.com/seminars
>
> To unsubscribe, visit the List Server section of OSR Online at
> http://www.osronline.com/page.cfm?name=ListServer

OSR_Community_User · September 14, 2013, 12:21am

We built a custom IOMMU mechanism in our multiprocessor system in 1972 or
thereabouts, called C.mmp, so the 16-bit DMA devices could address the
massive shared memory. We did it because the idea was old then, so we
said “how are we going to implement this idea on our hardware?”
joe

>IOMMU or VT-d involvement.

BTW - good old AGP GART is a kind of IOMMU.

AGP440 source was once provided with DDK. It is the DMA adapter object
implementation around this IOMMU.

And computers like Sun Sparc used IOMMU in early 1990ies.

–
Maxim S. Shatskih
Microsoft MVP on File System And Storage
xxxxx@storagecraft.com
http://www.storagecraft.com

NTDEV is sponsored by OSR

Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

Large DMA map buffers

Thank you, Calin

Thank you, Calin

Regards, Calin

Thank you,
Calin

Thank you,
Calin

Regards,
Calin