How to prevent physical to virtual address mapping from changing?

Hua · May 21, 2020, 4:56pm

Hi there,

I found this community forum through an article posted on OSR Online, about sharing memory between drivers and apps.

I was debugging a problem about sharing memory between a driver and multiple apps through a mechanism described in details on my post on stackoverflow.

The mechanism worked for years for Windows server 2008 or below, now we are planning to upgrade to server 2016 but got blocked by this issue.

Could anyone shed any lights on this? Thank you!

Best regards,
Hua

Tim_Roberts · May 22, 2020, 4:42am

It is a bit tacky to require us to look at another page to get the problem description.

The comment in the documentation that “the virtual address might be unmapped, reused, etc” does NOT mean that such things might happen spontaneously. What it means that, after a driver has called MmProbeAndLockPages, it’s possible for the process to exit, or for the application to free the memory pages and allocate something new to that same address. If the process has not exited and the application has not rejiggered its memory allocations, then there is NO WAY that the physical-to-virtual mapping can change. The system will not do that on its own.

Have you ensured that all of your accesses are on the same NUMA node?

Hua · May 25, 2020, 2:01pm

@Tim_Roberts Sorry, you are right, I am posting the details below and will answer your questions next:

We are debugging a problem seems to only happen after we upgrade the OS from Windows Server 2008 to Windows Server 2016 (not sure if this is relevant).

We have a series of memory pages (in 100s MB) allocated as shared memory data buffers (organized into a circular buffer), and allow a peripheral card to DMA data into these pages. These memory pages are allocated using CreateFileMappingNumaA first in user space, then use IoAllocateMdl function to establish MDL, then locked by using MmProbeAndLockPages, the physical addresses for these memory pages are provided to the peripheral card as DMA descriptors to perform DMA writes to memory, then an app would read data from these memory buffers through corresponding virtual addresses.

The mapping between virtual and physical addresses for a page are done by OS when allocated.

MmProbeAndLockPages then probes those memory pages and requests the OS to lock them (so it won’t be cached or swapped to a secondary storage). According to the description of this function, tt ensure the physical memory is locked but doesn’t guarantee the virtual addresses won’t get reused or unmapped.

With server 2008, we can run this for weeks with no problem. But after recently updating to server 2016, this mechanism seems to only work for up to 5 hours before the data read from a certain virtual addresses becomes outdated and unchanged, as if DMA has stopped writing into this page, even though the pages before or after in the circular buffer still get updated by continuous DMAs traffic as it should.

By comparing the virtual & physical address mapping after initialized and after the problem happens, we have confirmed that the same virtual address is mapped to a different physical address when the problem happens. So the DMA engine is mostly writing to the old physical address, the app is reading from the virtual address mapped to a different physical page

What I don’t know is, what may trigger the remapping of virtual to physical address when physical pages are locked from paging to a secondary storage? And how to prevent this from happening?

Is it possible that MmProbeAndLockPages only locks the physical memory page in main memory, but it doesn’t lock the address mapping between virtual and physical so if needed, OS can still change the mapping without the app knows it (for example, swapping the page out to a pagefile and swapping the page back into main memory but at a different physical address with the content copied to new address too… but the content is outdated because the DMA engine in the peripheral card is still writing into the old physical address for this page)? If that’s the case, is it possible to lock the mapping after initialization so OS won’t do that?

Not sure if I described it clearly so please feel free to ask questions. Also, if there are better tags to use for this topic, please let me know.

Your advice is very much appreciated!

Edit:

MmMapLockedPagesSpecifyCache seems to be a function to map a locked physical page to a virtual address explicitly, is this the way to do it? Or is there an elegant way to do this properly?

Hua · May 25, 2020, 2:27pm

@Tim_Roberts We are pretty sure the process has not exited before or when the problem happens because only a few pages stopped getting updated by DMA writes from the card, most other pages are continued to be updated after we detected the problem.

We will double check if app rejigged the memory allocation but pretty sure it didn’t. All pages have a fixed virtual address in this app’s memory space, it never changed. We examined the memory content pointed by the virtual memory address, it was an packet was received before (with an timestamp in the packet) and should have been overwritten by newer DMA writes from the card but it didn’t. It didn’t appear to be corrupted neither.

If there is a bug rejigging the memory, shouldn’t the virtual address be changed? Unless it was rejigged in a very specific way but we are not aware of that code exists… but will check.

So the same virtual memory address (VAddr) points to a page w/ outdated content. The next thing we checked with the physical memory address, and we found the found the physical memory address got changed. If it was mapped to PhyAddr1 after initialization, 5 hours later, it was mapped to PhyAddr2.

Then we checked the content in PhyAddr1, and found it was updated with newer packets, that means the DMA writes were still happening from the card, as expected. It’s just the app can only access PhyAddr2 now through VAddr.

Are you sure OS won’t do anything to re-map VAddr from PhyAddr1 to PhyAddr2? Other than process exiting & restarting, or app doing it itself? Do you see any flaws in our memory management mechanism that could allow OS to re-map?

I will double check the code on your question about accessing on the same NUMA node and get back to you.

Thanks!

Tim_Roberts · May 25, 2020, 9:01pm

When you call IoAllocateMdl, are you passing the user-mode virtual address? How are you getting the address in kernel mode? I assume you must have done this in the user process context, otherwise it wouldn’t work at all. Are you quite sure you are mapping the entire buffer? The operating system won’t change virtual-to-physical mappings of a locked buffer, so there must be something strange going on.

Peter_Viscarola_OSR · May 26, 2020, 4:52pm

Hmmmm… There’s a lot to not like in your problem description.

The biggest thing is this:

then locked by using MmProbeAndLockPages, the physical addresses for these memory pages are provided to the peripheral card as DMA descriptors to perform DMA writes to memory

Hmmm… You can’t use the “physical address” of those pages to do DMA. Doing so is not only a violation of the Windows OS architecture, it also won’t work on (the increasing number of) systems that implement DMA Remapping (I/O MMU).

Also… CreateFileMappingNumaA (or any “create file mapping”) is absolutely the wrong way to implement a block of memory that’s shared between kernel-mode and user-mode in Windows. I have a scoop for you, in case you didn’t notice: Windows is not Linux. In Windows devices are not files, and you can’t mmap them.

Finally, as Mr. Roberts correctly noted, Windows will not arbitrarily change user-mode virtual addresses of mapped buffers. I mean… THINK about it for a minute: HOW could that work? Let’s say you VirtualAlloc a block of memory, and you read some data into it asynchronously. And when the read is complete, the data is in the buffer… but the user virtual address of the buffer has changed. How would an app ever know this?

So… like many folks… you are searching for an answer to a very specific question… but you’re asking the wrong question.

I know, that’s probably not the answer you want to hear. But, that’s the story.

Peter

Hua · May 26, 2020, 9:53pm

@Tim_Roberts said:
When you call IoAllocateMdl, are you passing the user-mode virtual address? How are you getting the address in kernel mode? I assume you must have done this in the user process context, otherwise it wouldn’t work at all. Are you quite sure you are mapping the entire buffer? The operating system won’t change virtual-to-physical mappings of a locked buffer, so there must be something strange going on.

Thanks @Tim_Roberts & @Peter_Viscarola_(OSR) for your replies!

Tim, below are (more) direct answers to your questions. I will write a better description after this to provide better context as suggested by Peter:

Yes, we do pass the user mode virtual address through calling IoAllocateMdl in kernel mode
Not sure if you were referring to physical address (or logic address) or virtual address in your 2nd questions. If it’s later I will explain in the description below, but for logic address, we use PMAP_TRANSFER . These logic addresses are DMA read by the card in batches from memory and used as destination address for DMA writes conducted by a DMA engine on the card. This part is confirmed to be working, not impacted by the bug we are targeting here
Yes, we allocated the buffers in user mode, passed the base virtual address and the buffer sizes to a DeviceIoControl , in which we do IoAllocateMdl and MmProbeAndLockPages
No, we are not mapping the entire buffer in one shot but we do map the entire buffer. The circular buffer could be 512MB or 1GB in size, but depending on how big is the system memory, we allocate the circular buffer in multiple 132MB segments in user mode. Each virtual address segment’s base address is passed to our driver function to establish MDL and lock in kernel mode. In the driver function use IoAllocateMdl to map and lock in 32MB blocks until the whole segment is mapped and locked. The MDLs, as well as the logic address was passed back to app in user mode again. We do this for every segment until the whole circular buffer is done. Note that logic addresses passed back are pointers to 4KB physical pages, based on which we create even finer 1KB DMA descriptors in user space, for Bus-Master DMA engine in the card to use later for 1KB DMA writes.
For your question earlier about NUMA node, initially (let’s say version 0) we didn’t use CreateFileMappingNumaA, instead, we used CreateFileMappingA. We changed to CreateFileMappingNumaA to remove one variable, but it still failed after the change. Your question get us looking a little deeper into this, and realized that it may be more complicated to properly map the memory to a Numa… do we need to which Numa node the current app is on and lock the app first, then map the memory to the corresponding Numa node? if you could provide some guidance on how to properly do this, that would be great.

Hua · May 26, 2020, 11:48pm

Okay, below is another attempt to describe the bigger picture (why we would like to glue wings to our pig? @“Peter_Viscarola_(OSR)” , and the previous description did do a poor job):

We have an interface card getting tons of packets from the line and we need to pass those packets to a suite of apps to process. These packets have a max size of 1KB
The interface card has a Bus-Master DMA engine that reads descriptors from main memory to get the destination (logical) address to do DMA writes to main memory
Between the suite of apps and the interface card, we have an app (say app1) just a layer above the driver to load and initialize drivers. Part of the initialization includes allocating memory, lock it in memory, create a list of descriptors (one per DMA transfer/one per packet) and let the device card to know where to get them
Once initialized, app1 can share the memory buffer w/ other apps so all apps (let’s say one of them is app2) can access all packets
This memory buffer is basically a large circular buffer of a series of 1KB packet buffers, let’s call this memory buffer as circular buffer from now on. This circular buffer is allocated using CreateFileMappingA as shared memory (between apps) in 132MB segments in user mode, then mapped and locked in kernal mode in 32MB blocks, then a table is created in kernel mode to describe each segment in 4KB pages recorded both logical and virtual addresses, the table is returned to user space and based on which another table (packet buffer table) is created to describe the the circular buffer in 1KB packet buffers.
The packet buffer table is also allocated using CreateFileMappingA as a shared memory, with this table shared between apps, all an app needed to access a specified packet buffer in circular buffer, is its index, which is an offset from the base address of the circular buffer
The mechanism above works well on Windows Server 2008, but in the test setup we created for Windows Server 2016, we found that sometimes we could miss a few packets once in upto 5 hours, for whatever reason. For the same hardware, same driver, same app1 and app2, running different OS, the result seems to be different.
Everytime when it fails, we notice that it fails 4 packets in a roll, which happens to reside on the same physical page (the question was titled as such largely due to the correlation here, but @“Peter_Viscarola_(OSR)” was right that a correlation may not determine a causation)

To be more specific, our testing is done in this way:

Start app2, which starts app 1 to initialize everything
Then starts a process in app2 to allow packets to flow in, after about 10 minutes, all packet buffers in the circular buffer have been written at least once
After that new packets starts overwriting old packets, app1 tracks the packets and increments the packet index to reflects the progress of overwriting
App2 based on the updated index to check if the packets are overwritten as expected (new packets have different signatures from old ones)
If a packet is found outdated (when the index has grown past its index and it’s found with old signature), the whole process stops (but packets usually keep flowing in for a bit)
If no packets found outdated after about 20 minutes (all packets in circular buffer are overwritten at least once now), we halt this test iteration, stop app 2, which close app1 gracefully before closing
Go to 1 again to start a new iteration

It would take 2-5 hours of repeating the test to found an outdated packet occurence and every time when it happens, it have 4 packets in a row.

Note, if we don’t do step 6, which triggers a sequence of restarting app1 and app2, it may run forever without reproducing the problem

Hopefully this at least provides a rough context to allow more questions to be asked to fill the gaps. I am grateful @Tim_Roberts & @“Peter_Viscarola_(OSR)” even bothered to reply before this.

Peter_Viscarola_OSR · May 27, 2020, 1:29am

OK… so,we’ve,established you’re not using physical addresses for DMA. You’re calling MapTransfer and thus using Device Bus Logical Addresses. We can all sleep tonight. Good.

What lead you to use CreateFileMapping for this task? This strikes me as a poor choice, because pages mapped this way are intended to be managed by the Windows Cache Manager. I think I’ve seen it done before, and I guess it should work, but it makes me uncomfortable compared to other more straight-forward approaches. I can’t he,p. It wonder if you’re not hitting some weird edge condition related to this.

Peter

Hua · May 27, 2020, 5:51am

What lead you to use CreateFileMapping for this task? This strikes me as a poor choice, because pages mapped this way are intended to be managed by the Windows Cache Manager. I think I’ve seen it done before, and I guess it should work, but it makes me uncomfortable compared to other more straight-forward approaches. I can’t he,p. It wonder if you’re not hitting some weird edge condition related to this.

@“Peter_Viscarola_(OSR)” this is actually legacy code. The design decision was probably made a decade ago. Since it always worked, so we haven’t dived into it until now.

I am speculating here but maybe CreateFileMapping is recommended to create named shared memory between apps, we do want to share the circular buffer between apps, not just sharing between one app and the driver.

Another requirement is the size of the circular buffer, could be 1GB.

Do you know a common way to handle both requirements? You advice is much appreciated!

anton_bassov · May 27, 2020, 9:18am

What lead you to use CreateFileMapping for this task? This strikes me as a poor choice,

Something tells me that this may well be the root of the problem. To be honest, I just don’t see any reason why the Memory Manager would want to change the virtual-to-physical mapping if the target physical page is locked in RAM. I just wonder if the section may be the “culprit” here. Probably, MM treats the mapped pages a bit differently from the anonymous ones in this respect, taking into consideration that sections are meant too be shared across the process boundaries, as well as used by the Cache Manager.

Anton Bassov

Hua · May 27, 2020, 6:12pm

@anton_bassov Do you know a common way to allocate a large memory buffer to be shared between device, driver, and multiple apps?

Or, do we have to get data in a memory buffer shared between device, driver and app1, then copy into another memory buffer to be shared among multiple apps?

anton_bassov · May 27, 2020, 8:41pm

Do you know a common way to allocate a large memory buffer to be shared between device, driver, and multiple apps?

Don’t you see any potential security-related issues with this approach??? I don’t even mention the fact that an access to this buffer has to be synchronised somehow.

In any case, notwithstanding the above, MmMapLockedPagesSpecifyCache() allows you to map an MDL to the userland part of the address space… It is understandable that this call has to be made in context of the target process.

Anton Bassov

Hua · May 27, 2020, 9:25pm

Yes the access is synchronized… only the device can perform DMA writes to these buffers in a particular sequence defined by descriptor lists, which was initialized by app1. All the other apps can only read from these buffers.

The DMA engine in the device also maintains a DMA counters incremented on every DMA writes, which is used by app1 to track the progress of how the circular buffer is written. For the packets written, app1 would check each packet briefly and let other apps know how many new packets are available for them (not all packets go to all other apps). The other apps only have read access to these buffers (could be concurrent tho) and only access them after being notified by app1.

We actually tried MmMapLockedPagesSpecifyCache() approach in this process and failed… I am sure I have questions about how to properly use this but let me digest a bit.

Thanks!

Hua · May 27, 2020, 10:13pm

@anton_bassov

When we tried MmMapLockedPagesSpecifyCache(), we were still using CreateFileMapping to allocate named shared memory in user mode of app1, even though it was locked in kernel mode first before calling MmMapLockedPagesSpecifyCache. We got an except when calling it…

Do you think the following would work?

App1 reserves enough memory in user mode by using VirtualAlloc
By using DeviceIoControl , app1 requests to, in kernel mode, allocate memory buffers in small chunks, establish MDLs, lock them, and map each MDL to relevant virtual addresses by using MmMapLockedPagesSpecifyCache
App2 gets the memory size from app1 to reserve in the same way (VirtualAlloc)
App2 also call DeviceIoControl to kernel mode, but only to map existing MDLs to app2’s virtual addresses using MmMapLockedPagesSpecifyCache
App1 & app2 still only have read access to these memory buffers, app2 is still synchronized by app1 in accessing these memory buffers
Other apps would behave similar to app2

A few more questions:

Could the same memory buffer being mapped to multiple processes’ user space?
Should we specify these memory noncached or cached?

Thank you!

anton_bassov · May 28, 2020, 5:58am

Do you think the following would work?

App1 reserves enough memory in user mode by using VirtualAlloc
By using DeviceIoControl , app1 requests to, in kernel mode, allocate memory buffers in small chunks, establish MDLs, lock them,
and >map each MDL to relevant virtual addresses by using MmMapLockedPagesSpecifyCache
App2 gets the memory size from app1 to reserve in the same way (VirtualAlloc)
App2 also call DeviceIoControl to kernel mode, but only to map existing MDLs to app2’s virtual addresses using >MmMapLockedPagesSpecifyCache
App1 & app2 still only have read access to these memory buffers, app2 is still synchronized by app1 in accessing these memory buffers
Other apps would behave similar to app2

Well, under the normal circumstances (i.e. if we were speaking about a tightly-coupled app-driver pair) I would rather suggest allocating
a buffer in the userland. However,in your particular case (i.e.the target buffer is shared by multiple apps), this approach would imply some extra things to worry about and issues to deal with. For example, consider what happens if the process that has actually allocated memory terminates abnormally while some other apps still need the target buffer. Therefore, in this particular case your decision to allocate memory in a driver seems to be justified.

However, mapping an MDL to the userland address that has been already reserved by VirtualAlloc() may be rather problematic

Could the same memory buffer being mapped to multiple processes’ user space?

Assuming that we speak about an MDL that describes the locked pages, why not?

Should we specify these memory noncached or cached?

Unless we are speaking about something very specific (like, for example, memory-mapped device BARs), you should always specify cached memory type. Furthermore, if a caching type that you have specified conflicts with the one of some already existing mapping,
this parameter is, IIRC, going to modified by MmMapLockedPagesSpecifyCache() behind the scenes anyway…

Anton Bassov

Hua · May 28, 2020, 2:53pm

Thanks Anton!

@anton_bassov said:
However, mapping an MDL to the userland address that has been already reserved by VirtualAlloc() may be rather problematic

Would you please elaborate on that? Is it not how MmMapLockedPagesSpecifyCache is expected to be used?

Could the same memory buffer being mapped to multiple processes’ user space?
Assuming that we speak about an MDL that describes the locked pages, why not?

Yes, we are talking about locked physical pages for read access only from different apps in their own virtual space… we can try this basic idea pretty quick

Should we specify these memory noncached or cached?
Unless we are speaking about something very specific (like, for example, memory-mapped device BARs), you should always specify cached memory type. Furthermore, if a caching type that you have specified conflicts with the one of some already existing mapping,

Thanks for confirming this… but I am not sure if enabling cache would make this approach vulnerable to the same issue we got with our current approach. At the same time, the performance on noncached memory is painful too.

this parameter is, IIRC, going to modified by MmMapLockedPagesSpecifyCache() behind the scenes anyway…

Will need to digest a bit before I can tell if I understand this

anton_bassov · May 28, 2020, 5:19pm

Would you please elaborate on that? Is it not how MmMapLockedPagesSpecifyCache is expected to be used?

IIRC, there was a thread where a poster was trying to do exactly this kind of thing, and asking why he was always getting an error.
Therefore, someone (I think it was Mr.Noone) pointed out to him that a call MmMapLockedPagesSpecifyCache() on the existing userland address was bound to fail.

These days WRK is publicly available, so that you can always check the sources. Therefore, in order to avoid putting a foot in my mouth yet another time, I decided to do just that before typing this post. Check MmMapLockedPagesSpecifyCache(), and you will see a self-explaining sequence of MiMapLockedPagesInUserSpace() - >MiCheckForConflictingVadExistence() → MiCheckForConflictingNode()
calls behind the scenes

I am not sure if enabling cache would make this approach vulnerable to the same issue we got with our current approach.

How may your current problem be possibly related to caching type that is specified in PTE???

Anton Bassov

MBond2 · May 29, 2020, 2:00am

As a total happenstance, I have been looking at the file mapping process for an unrelated project. In my case, there is UM to UM communication via the loopback adapter that might be optimized by using shared memory instead, but for this application I have some questions

what makes app1 special that it should arbitrate which ‘packets’ app2, app3 etc. should see?
assuming that you could make a coherent view in memory from your device (the writer) and these various processes, how do you expect to synch access to the data?

If the UM processes have read only views, there is no way for them to confirm when they have read or copied any part of the data, so unless it is single bytes or some kind of telemetry that can go ‘in and out’ of sync without serious harm (Audio and video streams are also data that could take this kind of loss, but records of financial transactions (my industry) are not) you cannot do this without some kind of sync

Then, add the complexity of one UM app arbitrating what other UM apps can ‘see’. This cannot be done in an single shared memory region in any efficient way notwithstanding the read only problem above. This can of course be done, but what data rate are you targeting? Maybe I missed that but unless it is at least 1 Gb/s sustained this is all way too complicated

MBond2 · May 30, 2020, 1:57am

Without any direct knowledge of your specific use case, I have been thinking about this problem. In my case, statistics show that a heavily loaded system shows about 9 ms latency between processes sending TCP data via the loop back adapter

Coding the writer into shared memory seems trivial when there is a single writer. But coding the reader (whether one or many) seems much harder. If the data rate is consistent, then a sleep loop, wasteful as it might be, is an easy way to do this. But if the data rate is variable, it is much harder. Long periods of nothing to do, and then brief periods of more work than you can possibly handle

Again, if the readers can tolerate lost data, the problem is trivial. If not then it is not. If my problem is relevant for your problem, I’ll continue to tell you about my progress in the hope that it helps you. If not, then let me know