How to prevent physical to virtual address mapping from changing?

Hua · May 26, 2020, 9:53pm

@Tim_Roberts said:
When you call IoAllocateMdl, are you passing the user-mode virtual address? How are you getting the address in kernel mode? I assume you must have done this in the user process context, otherwise it wouldn’t work at all. Are you quite sure you are mapping the entire buffer? The operating system won’t change virtual-to-physical mappings of a locked buffer, so there must be something strange going on.

Thanks @Tim_Roberts & @Peter_Viscarola_(OSR) for your replies!

Tim, below are (more) direct answers to your questions. I will write a better description after this to provide better context as suggested by Peter:

Yes, we do pass the user mode virtual address through calling IoAllocateMdl in kernel mode
Not sure if you were referring to physical address (or logic address) or virtual address in your 2nd questions. If it’s later I will explain in the description below, but for logic address, we use PMAP_TRANSFER . These logic addresses are DMA read by the card in batches from memory and used as destination address for DMA writes conducted by a DMA engine on the card. This part is confirmed to be working, not impacted by the bug we are targeting here
Yes, we allocated the buffers in user mode, passed the base virtual address and the buffer sizes to a DeviceIoControl , in which we do IoAllocateMdl and MmProbeAndLockPages
No, we are not mapping the entire buffer in one shot but we do map the entire buffer. The circular buffer could be 512MB or 1GB in size, but depending on how big is the system memory, we allocate the circular buffer in multiple 132MB segments in user mode. Each virtual address segment’s base address is passed to our driver function to establish MDL and lock in kernel mode. In the driver function use IoAllocateMdl to map and lock in 32MB blocks until the whole segment is mapped and locked. The MDLs, as well as the logic address was passed back to app in user mode again. We do this for every segment until the whole circular buffer is done. Note that logic addresses passed back are pointers to 4KB physical pages, based on which we create even finer 1KB DMA descriptors in user space, for Bus-Master DMA engine in the card to use later for 1KB DMA writes.
For your question earlier about NUMA node, initially (let’s say version 0) we didn’t use CreateFileMappingNumaA, instead, we used CreateFileMappingA. We changed to CreateFileMappingNumaA to remove one variable, but it still failed after the change. Your question get us looking a little deeper into this, and realized that it may be more complicated to properly map the memory to a Numa… do we need to which Numa node the current app is on and lock the app first, then map the memory to the corresponding Numa node? if you could provide some guidance on how to properly do this, that would be great.

Hua · May 26, 2020, 11:48pm

Okay, below is another attempt to describe the bigger picture (why we would like to glue wings to our pig? @“Peter_Viscarola_(OSR)” , and the previous description did do a poor job):

We have an interface card getting tons of packets from the line and we need to pass those packets to a suite of apps to process. These packets have a max size of 1KB
The interface card has a Bus-Master DMA engine that reads descriptors from main memory to get the destination (logical) address to do DMA writes to main memory
Between the suite of apps and the interface card, we have an app (say app1) just a layer above the driver to load and initialize drivers. Part of the initialization includes allocating memory, lock it in memory, create a list of descriptors (one per DMA transfer/one per packet) and let the device card to know where to get them
Once initialized, app1 can share the memory buffer w/ other apps so all apps (let’s say one of them is app2) can access all packets
This memory buffer is basically a large circular buffer of a series of 1KB packet buffers, let’s call this memory buffer as circular buffer from now on. This circular buffer is allocated using CreateFileMappingA as shared memory (between apps) in 132MB segments in user mode, then mapped and locked in kernal mode in 32MB blocks, then a table is created in kernel mode to describe each segment in 4KB pages recorded both logical and virtual addresses, the table is returned to user space and based on which another table (packet buffer table) is created to describe the the circular buffer in 1KB packet buffers.
The packet buffer table is also allocated using CreateFileMappingA as a shared memory, with this table shared between apps, all an app needed to access a specified packet buffer in circular buffer, is its index, which is an offset from the base address of the circular buffer
The mechanism above works well on Windows Server 2008, but in the test setup we created for Windows Server 2016, we found that sometimes we could miss a few packets once in upto 5 hours, for whatever reason. For the same hardware, same driver, same app1 and app2, running different OS, the result seems to be different.
Everytime when it fails, we notice that it fails 4 packets in a roll, which happens to reside on the same physical page (the question was titled as such largely due to the correlation here, but @“Peter_Viscarola_(OSR)” was right that a correlation may not determine a causation)

To be more specific, our testing is done in this way:

Start app2, which starts app 1 to initialize everything
Then starts a process in app2 to allow packets to flow in, after about 10 minutes, all packet buffers in the circular buffer have been written at least once
After that new packets starts overwriting old packets, app1 tracks the packets and increments the packet index to reflects the progress of overwriting
App2 based on the updated index to check if the packets are overwritten as expected (new packets have different signatures from old ones)
If a packet is found outdated (when the index has grown past its index and it’s found with old signature), the whole process stops (but packets usually keep flowing in for a bit)
If no packets found outdated after about 20 minutes (all packets in circular buffer are overwritten at least once now), we halt this test iteration, stop app 2, which close app1 gracefully before closing
Go to 1 again to start a new iteration

It would take 2-5 hours of repeating the test to found an outdated packet occurence and every time when it happens, it have 4 packets in a row.

Note, if we don’t do step 6, which triggers a sequence of restarting app1 and app2, it may run forever without reproducing the problem

Hopefully this at least provides a rough context to allow more questions to be asked to fill the gaps. I am grateful @Tim_Roberts & @“Peter_Viscarola_(OSR)” even bothered to reply before this.

Peter_Viscarola_OSR · May 27, 2020, 1:29am

OK… so,we’ve,established you’re not using physical addresses for DMA. You’re calling MapTransfer and thus using Device Bus Logical Addresses. We can all sleep tonight. Good.

What lead you to use CreateFileMapping for this task? This strikes me as a poor choice, because pages mapped this way are intended to be managed by the Windows Cache Manager. I think I’ve seen it done before, and I guess it should work, but it makes me uncomfortable compared to other more straight-forward approaches. I can’t he,p. It wonder if you’re not hitting some weird edge condition related to this.

Peter

Hua · May 27, 2020, 5:51am

What lead you to use CreateFileMapping for this task? This strikes me as a poor choice, because pages mapped this way are intended to be managed by the Windows Cache Manager. I think I’ve seen it done before, and I guess it should work, but it makes me uncomfortable compared to other more straight-forward approaches. I can’t he,p. It wonder if you’re not hitting some weird edge condition related to this.

@“Peter_Viscarola_(OSR)” this is actually legacy code. The design decision was probably made a decade ago. Since it always worked, so we haven’t dived into it until now.

I am speculating here but maybe CreateFileMapping is recommended to create named shared memory between apps, we do want to share the circular buffer between apps, not just sharing between one app and the driver.

Another requirement is the size of the circular buffer, could be 1GB.

Do you know a common way to handle both requirements? You advice is much appreciated!

anton_bassov · May 27, 2020, 9:18am

What lead you to use CreateFileMapping for this task? This strikes me as a poor choice,

Something tells me that this may well be the root of the problem. To be honest, I just don’t see any reason why the Memory Manager would want to change the virtual-to-physical mapping if the target physical page is locked in RAM. I just wonder if the section may be the “culprit” here. Probably, MM treats the mapped pages a bit differently from the anonymous ones in this respect, taking into consideration that sections are meant too be shared across the process boundaries, as well as used by the Cache Manager.

Anton Bassov

Hua · May 27, 2020, 6:12pm

@anton_bassov Do you know a common way to allocate a large memory buffer to be shared between device, driver, and multiple apps?

Or, do we have to get data in a memory buffer shared between device, driver and app1, then copy into another memory buffer to be shared among multiple apps?

anton_bassov · May 27, 2020, 8:41pm

Do you know a common way to allocate a large memory buffer to be shared between device, driver, and multiple apps?

Don’t you see any potential security-related issues with this approach??? I don’t even mention the fact that an access to this buffer has to be synchronised somehow.

In any case, notwithstanding the above, MmMapLockedPagesSpecifyCache() allows you to map an MDL to the userland part of the address space… It is understandable that this call has to be made in context of the target process.

Anton Bassov

Hua · May 27, 2020, 9:25pm

Yes the access is synchronized… only the device can perform DMA writes to these buffers in a particular sequence defined by descriptor lists, which was initialized by app1. All the other apps can only read from these buffers.

The DMA engine in the device also maintains a DMA counters incremented on every DMA writes, which is used by app1 to track the progress of how the circular buffer is written. For the packets written, app1 would check each packet briefly and let other apps know how many new packets are available for them (not all packets go to all other apps). The other apps only have read access to these buffers (could be concurrent tho) and only access them after being notified by app1.

We actually tried MmMapLockedPagesSpecifyCache() approach in this process and failed… I am sure I have questions about how to properly use this but let me digest a bit.

Thanks!

Hua · May 27, 2020, 10:13pm

@anton_bassov

When we tried MmMapLockedPagesSpecifyCache(), we were still using CreateFileMapping to allocate named shared memory in user mode of app1, even though it was locked in kernel mode first before calling MmMapLockedPagesSpecifyCache. We got an except when calling it…

Do you think the following would work?

App1 reserves enough memory in user mode by using VirtualAlloc
By using DeviceIoControl , app1 requests to, in kernel mode, allocate memory buffers in small chunks, establish MDLs, lock them, and map each MDL to relevant virtual addresses by using MmMapLockedPagesSpecifyCache
App2 gets the memory size from app1 to reserve in the same way (VirtualAlloc)
App2 also call DeviceIoControl to kernel mode, but only to map existing MDLs to app2’s virtual addresses using MmMapLockedPagesSpecifyCache
App1 & app2 still only have read access to these memory buffers, app2 is still synchronized by app1 in accessing these memory buffers
Other apps would behave similar to app2

A few more questions:

Could the same memory buffer being mapped to multiple processes’ user space?
Should we specify these memory noncached or cached?

Thank you!

anton_bassov · May 28, 2020, 5:58am

Do you think the following would work?

App1 reserves enough memory in user mode by using VirtualAlloc
By using DeviceIoControl , app1 requests to, in kernel mode, allocate memory buffers in small chunks, establish MDLs, lock them,
and >map each MDL to relevant virtual addresses by using MmMapLockedPagesSpecifyCache
App2 gets the memory size from app1 to reserve in the same way (VirtualAlloc)
App2 also call DeviceIoControl to kernel mode, but only to map existing MDLs to app2’s virtual addresses using >MmMapLockedPagesSpecifyCache
App1 & app2 still only have read access to these memory buffers, app2 is still synchronized by app1 in accessing these memory buffers
Other apps would behave similar to app2

Well, under the normal circumstances (i.e. if we were speaking about a tightly-coupled app-driver pair) I would rather suggest allocating
a buffer in the userland. However,in your particular case (i.e.the target buffer is shared by multiple apps), this approach would imply some extra things to worry about and issues to deal with. For example, consider what happens if the process that has actually allocated memory terminates abnormally while some other apps still need the target buffer. Therefore, in this particular case your decision to allocate memory in a driver seems to be justified.

However, mapping an MDL to the userland address that has been already reserved by VirtualAlloc() may be rather problematic

Could the same memory buffer being mapped to multiple processes’ user space?

Assuming that we speak about an MDL that describes the locked pages, why not?

Should we specify these memory noncached or cached?

Unless we are speaking about something very specific (like, for example, memory-mapped device BARs), you should always specify cached memory type. Furthermore, if a caching type that you have specified conflicts with the one of some already existing mapping,
this parameter is, IIRC, going to modified by MmMapLockedPagesSpecifyCache() behind the scenes anyway…

Anton Bassov

Hua · May 28, 2020, 2:53pm

Thanks Anton!

@anton_bassov said:
However, mapping an MDL to the userland address that has been already reserved by VirtualAlloc() may be rather problematic

Would you please elaborate on that? Is it not how MmMapLockedPagesSpecifyCache is expected to be used?

Could the same memory buffer being mapped to multiple processes’ user space?
Assuming that we speak about an MDL that describes the locked pages, why not?

Yes, we are talking about locked physical pages for read access only from different apps in their own virtual space… we can try this basic idea pretty quick

Should we specify these memory noncached or cached?
Unless we are speaking about something very specific (like, for example, memory-mapped device BARs), you should always specify cached memory type. Furthermore, if a caching type that you have specified conflicts with the one of some already existing mapping,

Thanks for confirming this… but I am not sure if enabling cache would make this approach vulnerable to the same issue we got with our current approach. At the same time, the performance on noncached memory is painful too.

this parameter is, IIRC, going to modified by MmMapLockedPagesSpecifyCache() behind the scenes anyway…

Will need to digest a bit before I can tell if I understand this

anton_bassov · May 28, 2020, 5:19pm

Would you please elaborate on that? Is it not how MmMapLockedPagesSpecifyCache is expected to be used?

IIRC, there was a thread where a poster was trying to do exactly this kind of thing, and asking why he was always getting an error.
Therefore, someone (I think it was Mr.Noone) pointed out to him that a call MmMapLockedPagesSpecifyCache() on the existing userland address was bound to fail.

These days WRK is publicly available, so that you can always check the sources. Therefore, in order to avoid putting a foot in my mouth yet another time, I decided to do just that before typing this post. Check MmMapLockedPagesSpecifyCache(), and you will see a self-explaining sequence of MiMapLockedPagesInUserSpace() - >MiCheckForConflictingVadExistence() → MiCheckForConflictingNode()
calls behind the scenes

I am not sure if enabling cache would make this approach vulnerable to the same issue we got with our current approach.

How may your current problem be possibly related to caching type that is specified in PTE???

Anton Bassov

MBond2 · May 29, 2020, 2:00am

As a total happenstance, I have been looking at the file mapping process for an unrelated project. In my case, there is UM to UM communication via the loopback adapter that might be optimized by using shared memory instead, but for this application I have some questions

what makes app1 special that it should arbitrate which ‘packets’ app2, app3 etc. should see?
assuming that you could make a coherent view in memory from your device (the writer) and these various processes, how do you expect to synch access to the data?

If the UM processes have read only views, there is no way for them to confirm when they have read or copied any part of the data, so unless it is single bytes or some kind of telemetry that can go ‘in and out’ of sync without serious harm (Audio and video streams are also data that could take this kind of loss, but records of financial transactions (my industry) are not) you cannot do this without some kind of sync

Then, add the complexity of one UM app arbitrating what other UM apps can ‘see’. This cannot be done in an single shared memory region in any efficient way notwithstanding the read only problem above. This can of course be done, but what data rate are you targeting? Maybe I missed that but unless it is at least 1 Gb/s sustained this is all way too complicated

MBond2 · May 30, 2020, 1:57am

Without any direct knowledge of your specific use case, I have been thinking about this problem. In my case, statistics show that a heavily loaded system shows about 9 ms latency between processes sending TCP data via the loop back adapter

Coding the writer into shared memory seems trivial when there is a single writer. But coding the reader (whether one or many) seems much harder. If the data rate is consistent, then a sleep loop, wasteful as it might be, is an easy way to do this. But if the data rate is variable, it is much harder. Long periods of nothing to do, and then brief periods of more work than you can possibly handle

Again, if the readers can tolerate lost data, the problem is trivial. If not then it is not. If my problem is relevant for your problem, I’ll continue to tell you about my progress in the hope that it helps you. If not, then let me know

Hua · June 1, 2020, 4:25am

> @anton_bassov said: > Therefore, someone (I think it was Mr.Noone) pointed out to him that a call MmMapLockedPagesSpecifyCache() on the existing userland address was bound to fail. Thanks, that may be a dead end then. We will do a quick try on this tomorrow. > How may your current problem be possibly related to caching type that is specified in PTE??? I don’t know… Peter eluded that we may have a weird problem with Windows Cache Manager, I am not familiar with that area and was wondering if the same problem could happen even if we change from CreateFileMapping to the approach I described above What’s PTE?

Hua · June 1, 2020, 4:33am

> @MBond2 said: > If my problem is relevant for your problem, I’ll continue to tell you about my progress in the hope that it helps you. Thank you for your replies @MBond2. Please go ahead to describe your progress if possible… I am particularly interested in the part to allocate non paged large memory to allow accesses from the card, the driver and multiple apps. The user cases at upper layers doesn’t really matter.

Peter_Viscarola_OSR · June 1, 2020, 3:38pm

Yes… I do suspect some sort of strange Cache Manager edge condition. But I’m not sure. I wouldn’t write the code the way you did… I would either (a) allocate the memory in the driver with MmAllocatePagesForMdlEx and map them back to the user address space, or (b) I would allocate the memory with VirtualAlloc in user-space, pin it in the driver, and it map it into the other (non-allocating) app’s address spaces in the driver.

This whole problem also begs the question of how to handle security issues… such as when an app calls DuplicateHandle on the handle that’s used to map the memory.

I reject the idea that, outside the context of the CC not being aware that a page is in use by a given UVA space or an odd OS bug, somehow the UVA to PA mapping arbitrarily changes. Like I said… one does not have a pointer (UVA) that points to something one minute… and then that pointer “goes bad” the next minute. If this is happening, it’s GOT to be happening because something is being mishandled in the app or in the driver. Some assumption is probably being made that was never strictly technically/architecturally correct, but is no longer the way things work.

But this problem needs focused work and analysis. I stepped out because this discussion got away from me. Sorry… there’s only so much I can do on a forum.

Peter

Hua · June 1, 2020, 6:47pm

Thanks Peter, your insights are very much appreciated.

Hua · June 1, 2020, 11:21pm

Hi all,

We just completed an successful over-the-weekend test on a Server 2016 machine but used MMAgent to disable all the following OS features:

ApplicationLaunchPrefetching
ApplicationPreLaunch
OperationAPI
PageCombining
MemoryCompression
CimSession <CimSession>
ThrottleLimit
AsJob

Memory Management Agent was introduced with Windows 8/2012, our upgrade path happens to cross it.

Not sure which ones are more relevant to the problem… we could use binary search to find it out eventually, but thought you might have some insights.

Page Combing and Memory Compression look the most suspicious to me… any suggestions?

Thanks,
Hua

Peter_Viscarola_OSR · June 2, 2020, 12:28pm

Wow. I had never even heard of that applet.

Thanks. I learned something today. I’m not sure how I’ll use it, by I did learn something.

Peter