Concatenate a large number of physical memory buffers to one large virtual buffer

OSR_Community_User · February 21, 2014, 9:59am

Hi

I am working with a driver that needs to handle a large amout of data very fast.
It is an embedded system, so security is not an issue, but speed is.

Our adapter is doing DMA (does not support scatterGather) in 4MB chunks,
but for the user mode program to be able to handle the data fast enough,
I need to concatenate these 4MB physical contigous chunks into one large virtual
contigous memory block. We have to support a virtual buffer of up 64GB.

What I am doing until now is that I am allocating a virtual buffer using
ZwAllocateVirtualMemory in order to find an empty area. The I free the memory
again. Now I have a start address of a free memory area. Using this start
address and the commands IoAllocateMdl, MmBuildMdlForNonPagedPool and
MmMapLockedPagesSpecifyCache, I am able to map the physical chunks into
a contigous virtual memory buffer.

The problem by doing this is that I cannot be sure that the memory area
continues to be free. I works fine for buffers up to 2GB, but for larger
buffer is fails a lot.

Here is an example of my code:

totalSize = 0;
for (i = 0; i < pMmap->count; i++) {
totalSize += pMmap->map[i].bufferSize;
}

// Reserve memory
nextAddress = NULL;
if (pMmap->count > 1) {
int retry = 0;
reservedSize = totalSize;
status = ZwAllocateVirtualMemory(NtCurrentProcess(), &nextAddress, 0L, &reservedSize, MEM_RESERVE, PAGE_NOACCESS);
if (STATUS_SUCCESS != status) {
return status;
}
ZwFreeVirtualMemory(NtCurrentProcess(), &nextAddress, &reservedSize, MEM_RELEASE);
}

for (i = 0; i < pMmap->count; i++) {
pMdl = IoAllocateMdl((PVOID)pMmap->map[i].driver.u.dataAddress, (ULONG)pMmap->map[i].bufferSize, FALSE, FALSE, NULL);
if (pMdl == NULL) {
log(ERR, “Error: IoAllocateMdl failed\n”);
return STATUS_INSUFFICIENT_RESOURCES;
}
MmBuildMdlForNonPagedPool(pMdl);

try
{
pMmap->map[i].user.u.pVirt = MmMapLockedPagesSpecifyCache(pMdl, UserMode, MmCached, nextAddress, FALSE, NormalPagePriority);
pMmap->map[i].base.u.dataAddress = (uint64_t)pMdl;
}
except(EXCEPTION_EXECUTE_HANDLER)
{
status = GetExceptionCode();
IoFreeMdl(pMdl);
return status;
}

// Map complete…
if (pMmap->count > 1) {
// If we have more than one mapping we need to be sure the mapping is contiguous.
if (nextAddress != pMmap->map[i].user.u.pVirt) {
// We did not get the wanted address range. Free all mapping.
// The other mappings will be freed by the fallback
MmUnmapLockedPages(pMmap->map[i].user.u.pVirt, pMdl);
IoFreeMdl(pMdl);
return STATUS_INSUFFICIENT_RESOURCES;
}
}
nextAddress = (PCHAR)nextAddress + pMmap->map[i].bufferSize;
}

Anyone who know a better (right) way to do this. I thought about using
chained MDL, but MmMapLockedPagesSpecifyCache does not support this.

Regards
Bent

Tim_Roberts · February 21, 2014, 1:06pm

xxxxx@napatech.com wrote:

Our adapter is doing DMA (does not support scatterGather) in 4MB chunks,
but for the user mode program to be able to handle the data fast enough,
I need to concatenate these 4MB physical contigous chunks into one large virtual
contigous memory block. We have to support a virtual buffer of up 64GB.

What I am doing until now is that I am allocating a virtual buffer using
ZwAllocateVirtualMemory in order to find an empty area. The I free the memory
again. Now I have a start address of a free memory area.

That is just nuts.

The requirement that your huge buffer be virtually contiguous is silly.
The performance difference between accessing a 64GB virtual region as
one continuous block vs 16,000 chunks of 4MB is not measurable.

So, allocate your 16,000 physically contiguous 4MB buffers. Map them to
user space one at a time. Pass an array of pointers up to the app.
Problem solved.

–
Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

OSR_Community_User · February 23, 2014, 11:03am

Hi Tim

I do not need people to tell, what I am trying to do is nuts. I need people to tell me whether or not it is possible to do what I want to do.

And I can tell you that the difference between accessing a chunck of 4MB buffers compared to accessing a contigous buffer is quite measurable.

Regards
Bent

Alex_Grig · February 23, 2014, 3:25pm

>And I can tell you that the difference between accessing a chunck of 4MB buffers
compared to accessing a contigous buffer is quite measurable.

OK, what kind of access you do to the buffer? Totally random, or by slices, or by sub-vectors, or as an array of structures? Are you running a giant FFT on sub-arrays?

Maxim_S_Shatskih · February 23, 2014, 7:37pm

> Our adapter is doing DMA (does not support scatterGather) in 4MB chunks,

but for the user mode program to be able to handle the data fast enough,
I need to concatenate these 4MB physical contigous chunks into one large virtual
contigous memory block. We have to support a virtual buffer of up 64GB.

In your app, allocate the buffer using the usual VirtualAlloc.

Then subdivide it in a linear way to 4MB chunks, and call overlapped ReadFile once per each 4MB chunk.

The collect the completions of these overlapped IOs together. At this moment, you have 64GB of virtually contiguous data.

The driver does not require any updates. It only sees 4MB chunks.

–
Maxim S. Shatskih
Microsoft MVP on File System And Storage
xxxxx@storagecraft.com
http://www.storagecraft.com

Alex_Grig · February 23, 2014, 9:07pm

>Then subdivide it in a linear way to 4MB chunks, and call overlapped ReadFile
once per each 4MB chunk

The OP’s adapter doesn’t support SG.

Anyway, if he can live with 2MB chunks, he could use large page allocation for the region:

VirtualAlloc(NULL, 0x1000000000ULL, MEM_COMMIT | MEM_RESERVE |MEM_LARGE_PAGES, PAGE_READWRITE);

OSR_Community_User · February 24, 2014, 5:16am

I am working with a network capture adapter. It has to be able to capture 40 Gb/s without any packetloss. To be able to do that, I am making a lot of things in the driver, that is not recommended by Microsoft. But it is an embedded system, so it is not a big problem. Spped is required. Even checking for packets that are splitted between one 4 MB chunk to another takes to long time.

> VirtualAlloc(NULL, 0x1000000000ULL, MEM_COMMIT | MEM_RESERVE |MEM_LARGE_PAGES,
PAGE_READWRITE);

How can I allocate contigous DMA memory with VirtualAlloc?

Peter_Viscarola_OSR · February 24, 2014, 10:02am

Is it possible? Everything is possible, Mr. Kuhre. We’re only talking about hardware, software, time and money after all. Usually, I’ve found that if something is inordinately difficult it’s because I’m doing something wrong.

But, be that as it may… let’s assume you really what to do what you’re saying.

Here’s the route I would try:

Reserve your enormous kernel-virtual address space using MmAllocateMappingAddress. Then, allocate the physically contiguous chunks using MmAllocatePagesForMdlEx. Note that this function does not default to returning contiguous memory, but you can set a flag to force it to return contiguous allocations. You’ll want to be highly permissive with the range of physical addresses you specify as being acceptable.

When your chunks are reserved, you can call MmMapLockedPagesWithReservedMemory to slot the physical chunks you allocated into your previously reserved KVA space.

I need to tell you I haven’t tried this myself. But, then again, given the problem as you’ve stated it I think it’s fair to say nobody has probably solved your exact problem.

/ And then he begins to scold the OP

While I hope the above helps, what I *really* suggest is that you revisit your design. Or talk with somebody who might have more insight into Windows system internals that can guide review your design. Maybe you’ve chosen the best alternative from among many bad options. I guess that’s possible. Oh, and get some hardware that supports S/G. Please.

Peter
OSR
@OSRDrivers

Alex_Grig · February 24, 2014, 10:30am

>How can I allocate contigous DMA memory with VirtualAlloc?

MEM_LARGE_PAGES gives you 2MB physically contiguous pieces. I hope your hardware can handle 2MB chunks instead of 4MB.

Peter:

When your chunks are reserved, you can call MmMapLockedPagesWithReservedMapping

Unfortunately, it cannot map to a random offset within the reserved range.

Peter_Viscarola_OSR · February 24, 2014, 12:12pm

Thank you, Mr. Grig, for the correction. It’s certainly appreciated, and I’m sure it’ll be appreciated by the OP even more

Windows is generally loathe to let you combine buffers into one big virtual address space, except for use of chained MDLs in the network stack… But I *swear* there was a way to do this using the Reserved Mapping feature… Grab a huge chunk of KVA space and then map buffers to it a piece at a time. Gad, what WAS the way to do this?

Now the OP has ME wondering.

Peter
OSR
@OSRDrivers

Alex_Grig · February 24, 2014, 12:48pm

>Grab a huge chunk of KVA space and then map buffers to it a piece at a time.

I wonder if you reserve the space with VirtualAlloc (without commit), will it let you do MmMapLockedPagesSpecifyCache/UserMode?

OSR_Community_User · February 24, 2014, 3:45pm

Alex: Yes I am using ZwVirtualAlloc without commit, but I have to free the allocated memory again before using MmMapLockedPagesSpecifyCache/UserMode, but I have a free area that I can use. Otherwise it will not work. The problem is that I need to be sure that no other process is allocating in the free area I have found. When using ZwVirtualAlloc I am not able to raise run level as it requires PASSIVE_LEVEL, so I not sure how to prevent other APC/PASSIVE processes to run while I am mapping.

Peter; As Alex says with MmAllocateMappingAddress it is only possible to map using the address returnd. You cannot map from within the buffer. Have been there.

Is it possible? Everything is possible, Mr. Kuhre. We’re only talking about
hardware, software, time and money after all. Usually, I’ve found that if
something is inordinately difficult it’s because I’m doing something wrong.

You are properly right, but my driver is far from a normal driver.

Alex_Grig · February 24, 2014, 4:18pm

>The problem is that I need to be sure that no other process is allocating in the free area I have found.

You’re doing it the wrong way around. You need to do that in the context of a dispatch routine for DeviceIoControl IRP. In this case, you’re guaranteed to be in the context of the target process, for as long as you need. It’s up to the client process to make sure its other threads do not allocate the virtual space in that time. The best time for it is during the process startup, when there are no other threads.
As you grabbed the memory, freeing and reallocating it would be unwise. Just hold to it.

But simply doing VirtualAlloc for large pages (2MB each) seems to be the best course of action.

anton_bassov · February 24, 2014, 5:12pm

Napatech network analysis adapters are designed to ensure

guaranteed collection and delivery of network analysis data under

all conditions. This includes zero packet loss capture of network

data with direct transfer to system memory ensuring optimal

availability and analysis performance. Each packet is time stamped

with nanosecond accuracy allowed precise identification of events

and measurement of time-related quality of service parameters,

such as latency and jitter.

In addition, Napatech adapters provide advanced data pre-processing

features that allow layer 1 to layer 4 flows, including GTP and IP-in-IP

tunnels, to be identified and distributed to up to 32 CPU core host

buffers. This multi-CPU distribution mechanism allows applications,

such as DPI applications, to scale performance using additional CPU

cores for processing

Am I the only one who finds the above description (as you may have guessed, taken from the OP’s company website) at least strange in absolutely all respects - from the very concept of implementing the functionality of NDIS IM filter in a hardware to delivering packets “under all conditions”(apparently, including even network failure)?

Anton Bassov

Mark_Roddy · February 24, 2014, 6:23pm

You may be reading more into “under all conditions” than they intended

Mark Roddy

On Mon, Feb 24, 2014 at 5:12 PM, wrote:

>

>
>
> Napatech network analysis adapters are designed to ensure
>
> guaranteed collection and delivery of network analysis data under
>
> all conditions. This includes zero packet loss capture of network
>
> data with direct transfer to system memory ensuring optimal
>
> availability and analysis performance. Each packet is time stamped
>
> with nanosecond accuracy allowed precise identification of events
>
> and measurement of time-related quality of service parameters,
>
> such as latency and jitter.
>
>
> In addition, Napatech adapters provide advanced data pre-processing
>
> features that allow layer 1 to layer 4 flows, including GTP and IP-in-IP
>
> tunnels, to be identified and distributed to up to 32 CPU core host
>
> buffers. This multi-CPU distribution mechanism allows applications,
>
> such as DPI applications, to scale performance using additional CPU
>
> cores for processing
>
>

>
>
> Am I the only one who finds the above description (as you may have
> guessed, taken from the OP’s company website) at least strange in
> absolutely all respects - from the very concept of implementing the
> functionality of NDIS IM filter in a hardware to delivering packets “under
> all conditions”(apparently, including even network failure)?
>
>
> Anton Bassov
>
> —
> NTDEV is sponsored by OSR
>
> Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev
>
> OSR is HIRING!! See http://www.osr.com/careers
>
> For our schedule of WDF, WDM, debugging and other seminars visit:
> http://www.osr.com/seminars
>
> To unsubscribe, visit the List Server section of OSR Online at
> http://www.osronline.com/page.cfm?name=ListServer
>

anton_bassov · February 24, 2014, 8:33pm

> You may be reading more into “under all conditions” than they intended

Please note that the OP says that he “has to be able to capture 40 Gb/s without any packetloss.”, and the docs on their website emphasize that their adapter is able to work without packet loss under any circumstances.
Needless to say that this requirement is simply unrealistic. For example, consider what happens if the system is low on RAM , say, because of the heavy network traffic. It just has no option other than to start dropping packets, right. However, the OP maintains that he cannot afford to lose packets…

Anton Bassov

Alex_Grig · February 24, 2014, 8:54pm

40 Gbps full duplex is 10 GB/s. 64 GB will only hold some 6 seconds. That’s OK, if there is a hardware to filter the packets and detect the trigger.

Peter_Viscarola_OSR · February 24, 2014, 11:44pm

Let me put this as nicely as possible:

Can we not resort to reviewing or criticizing the description of people’s products from their websites, please? I seriously doubt ANYbody would come to NTDEV wanting, or expecting, to have to defend their product this way.

Such criticism isn’t helpful, or even relevant, to the discussion here. If you want to scold the OP for something in his post, that’s fine. Heck, I did that. If you want to beat him over his marketing department’s choice of words or his product’s advertised goals… find someplace else to do it.

Peter
OSR
@OSRDrivers

anton_bassov · February 25, 2014, 12:55am

Peter,

Actually, It has absolutely nothing to do with " his marketing department’s choice of words"…

My point is that the OP may be simply trying to comply with the request that is unreasonable in itself,and, as a result, tries to do things that may be simply ridiculous from the technical standpoint. In your terminology, he may be requested to stick wings to a pig and has no chance to explain to his boss that pigs don’t fly (you wrote a good article about it few years ago in NT Insider). Some years ago I worked with a client like that so that I know what it is like…

Anton Bassov

OSR_Community_User · February 25, 2014, 3:02am

A strange way this discussion has gone. I though this forum was for helping each other.

[quote]
I do not need people to tell, what I am trying to do is nuts. I need people to
tell me whether or not it is possible to do what I want to do.

There is a reason for my comment in the beginning. Looking around the forum a lot of people ask questions. A lot of people answering the questions with comment about ?t is nuts, ridiculous etc. Please if you don’t know how to help, then don’t write an answer. Fortunately a lot of people wants to help. I appreciate that.

Anton: I am sure what you are trying to say???

Peter: You are absolutely right.

Alex: I am already doing it in a context of an IRP/Process. I had some bugs in my code, so it seems to work now. I think the way I am doing it is the only way, besides your idea about using VirtualAlloc. Thank you for your help.