Understanding poor performance of memory mapped files in system cache

I have a process that is memory mapping a 160GB file on a system with 500GB of memory.

It then proceeds to read one byte of every page in the mapping to ‘warm up’ the view, after this it requires a lot of random (read) access into the memory.

There is a huge performance difference that occurs during the warm up phase when the file is already in the system cache vs a cold start to the process. Whilst this might be expected, the problem is that it is way slower when the file is in the cache (eg. we just unzipped the file)

When running with the file in the cache we see one core pinned at 100% CPU which is all accounted as system time, but there is no IO activity.

This is running on a 2008R2 VM with 2 cores and 500GB of RAM. The host is not over allocated in anyway (and we’ve seen the same behaviour on a physical machine too)

Running kernrate during this time give the following hot spots:

----- Zoomed module NTOSKRNL.EXE (Bucket size = 16 bytes, Rounding Down) --------
Percentage in the following table is based on the Total Hits for this Zoom Module

ProfileTime 34943 hits, 65536 events per hit --------
Module Hits msec %Total Events/Sec
MmProbeAndLockPages 17453 227666 49 % 5024025
KeQueryPriorityThread 16994 227666 48 % 4891897
ExfAcquirePushLockExclusive 131 227666 0 % 37709

Profiling the same operation during a cold start gives a very different result:

----- Zoomed module NTOSKRNL.EXE (Bucket size = 16 bytes, Rounding Down) --------
Percentage in the following table is based on the Total Hits for this Zoom Module

ProfileTime 1630 hits, 65536 events per hit --------
Module Hits msec %Total Events/Sec
KeSynchronizeExecution 242 190710 14 % 83161
KeSaveFloatingPointState 194 190710 11 % 66666
KeWaitForMultipleObjects 112 190710 6 % 38487
KdPollBreakIn 100 190710 6 % 34364
MmProbeAndLockPages 78 190710 4 % 26804
ExpInterlockedPopEntrySList 55 190710 3 % 18900
CcSetDirtyPinnedData 50 190710 3 % 17182
KeReleaseInStackQueuedSpinLockFromDpcLevel 47 190710 2 % 16151
KeAcquireSpinLockAtDpcLevel 39 190710 2 % 13402

In the cold state the effective bandwidth used (ie: how many MB\s are touched) rises from 50MB\s to over 90MB\s

In the warm state the bandwidth spikes high (>250MB\s) for a few seconds and then shows what look like exponential drop off and falls below 30MB\s very quickly.

Does anyone have an explanation of what might be going on, or any suggestions on where else to dig?

Thanks!

Rob

What flags do you pass to open the file, and what flags to map it? Show the calls for CreateFile, CreateFileMapping, MapViewOfFile.

Thanks – the file and mapping are opened read only:

mHandle = CreateFile(filename,
GENERIC_READ,
FILE_SHARE_READ|FILE_SHARE_DELETE,
NULL,
OPEN_EXISTING,
FILE_ATTRIBUTE_NORMAL|FILE_FLAG_RANDOM_ACCESS,
NULL);

// sz is the full file size
mMapping = CreateFileMapping(mHandle, NULL, PAGE_READONLY, sz.HighPart, sz.LowPart, 0);

// mMappingSize is the full size of the file too
mData = MapViewOfFileEx(mMapping, FILE_MAP_READ, 0, 0, (size_t) mMappingSize, NULL);

  1. Using a remote kernel debugger, break in at random moments and see whether it’s handling the mapping page fault or the code page fault. If it’s code or stack page fault, then it’s an unfortunate defect of Windows memory management that makes everybody’s life difficult since forever.

  2. Try open the file with FILE_FLAG_NO_BUFFERING and see if it changes the “warm start” performance. (FILE_FLAG_RANDOM_ACCESS is then ignored).

>I have a process that is memory mapping a 160GB file on a system with 500GB of memory.

By using memory mapping, you’re up to the Mm’s and Cc’s heuristics, which are not developed by you and thus are not controllable by you.

They are controllable by the design team of the commodity OS which targets the average Joe’s tasks (they are not so primitive, for instance, this OS can run sophisticated server sofware like MSSQLServer or such, and run it well, but still this is the common task and what you do is uncommon).

160GB of mmaped file is surely not an average Joe’s task.

So, for such a task, be prepared to implement the whole replacement policy by your own without relying on underlying OS.

This can mean, for instance, allocating nonpageable memory (maybe using MmAllocatePageForMdl or such), and then using noncached reads of the file of a proper size to achieve maximum performance.

Probably you will need some user-mode cache of your own and use it instead of the Cc/Mm cache.

At least - don’t expect your “warm up” to have any positive value. It can have negative value instead.


Maxim S. Shatskih
Microsoft MVP on File System And Storage
xxxxx@storagecraft.com
http://www.storagecraft.com

Using FILE_FLAG_NO_BUFFERING makes things more uniform … but slower, averaging less than 30MB\s

I tried various other flag combinations, but nothing provided an improvement.

Can you elaborate a little on the “code or stack page fault” issue or the Windows memory management ‘defect’? I’m not sure I follow where this would come into play. I’ll need to dig ino how to set up a remote kernel debugger on this VM.

I have replacement code in a newer version that handles the memory management differently and uses async ReadFileScatter calls to pull the data off disk. This happily pegs out at 400MB\s and is the way to go for the future.

However, I was hoping to find a simpler ‘fix’ for the older release. I realize this is not a normal use case – but it is on a server OS with plenty of free memory even for this size of file.

Thanks.

>Can you elaborate a little on the “code or stack page fault” issue or the
Windows memory management ‘defect’?

In XP, when you had a lot of IO with large files (like copying a bunch of 700 MB files off a DVD-R), all RAM was preferentially given to hold these files in cache, at a cost of discarding code pages almost immediately. This caused severe code thrashing. The system would basically stand still.

And still, in Windows 7, with lots of memory, Windows will discard code and data pages even though the commit size is below physical memory size.

THis makes me think Windows will readily discard pages that belong to sections that don’t have a handle open (like all the executables - they only have a section handle open, not the file handle), giving too much preference to the file I/O.

In your case, the high cost might be because Windows may be using linear search for a candidate page to discard.

If the file is already cached and is being accessed again then nothing should be getting discarded.

I can’t tell from the kernrate output what’s really going on. A ~30 second xperf/WPA trace would be more useful. Something like:

xperf -on base -stackwalk profile
let the scenario run for 30 seconds
xperf -d base.etl
upload base.etl to skydrive, post a link here

-----Original Message-----
From: xxxxx@lists.osr.com [mailto:xxxxx@lists.osr.com] On Behalf Of xxxxx@broadcom.com
Sent: Monday, November 11, 2013 8:39 AM
To: Windows System Software Devs Interest List
Subject: RE:[ntdev] Understanding poor performance of memory mapped files in system cache

Can you elaborate a little on the “code or stack page fault” issue or the
Windows memory management ‘defect’?

In XP, when you had a lot of IO with large files (like copying a bunch of 700 MB files off a DVD-R), all RAM was preferentially given to hold these files in cache, at a cost of discarding code pages almost immediately. This caused severe code thrashing. The system would basically stand still.

And still, in Windows 7, with lots of memory, Windows will discard code and data pages even though the commit size is below physical memory size.

THis makes me think Windows will readily discard pages that belong to sections that don’t have a handle open (like all the executables - they only have a section handle open, not the file handle), giving too much preference to the file I/O.

In your case, the high cost might be because Windows may be using linear search for a candidate page to discard.


NTDEV is sponsored by OSR

Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer

It looks like a pathological case of some O(N^2) or O(N^3) algorithm within the memory manager, which becomes a bottleneck with very large sections.

If you have time, can you break a few times into the debugger during that slow priming, and post the stack traces? I wonder what’s there.

This is the unfortunate effect of LRU approach to page replacement strategy that Windows apparently
relies upon. Some algorithms (like ARC, for example, which is, probably, the most widely known but definitely not the only one) have been designed to alleviate the issue. When LRU approach gets combined with pageable kernel, you get the system crawling like a turtle under memory pressure…

Anton Bassov

I’ve never seen a turtle under memory pressure, so I’m not sure what one
looks like, or how fast it goes.

One of the problems with big-O measurements is that O(N^2) really
translates as T = t[setup] + C * N^2 + t[teardown].

For most algorithms, “teardown” time is zero, or indistinguishable from
it. For some, however, setup time is nontrivial and can actually have an
impact. And, finally, there’s that nasty constant of proportionality, C.
I’ve seen linear O(N) algorithms that sucked because C was HUGE. I once
had to rewrite an O(N log2(N)) algorithm to reduce C to a manageable size,
because it dominated the total time (setup and teardown were effectively
zero, so only C mattered). Yes, it could be O(N^2) or O(N^3), but when,
say, a disk write and a disk read become part of C, even O(N) can be a
flaming disaster. (Disk write to send a modified page out to make a page
frame available; disk read to bring in the new page). Do the arithmetic:
how many pages do you have to move (and I think the file sizes would have
been in MB, and the mapping in MB, because you can’t map more than 4GB,
AFAIK even in Win64…I may be wrong).
joe

This is the unfortunate effect of LRU approach to page replacement
strategy that Windows apparently
relies upon. Some algorithms (like ARC, for example, which is, probably,
the most widely known but definitely not the only one) have been designed
to alleviate the issue. When LRU approach gets combined with pageable
kernel, you get the system crawling like a turtle under memory
pressure…

Anton Bassov


NTDEV is sponsored by OSR

Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

Odd … my last post seemed to disappear somewhere into the ether.

Anyway, after being dragged off this for a couple of days, I have some xperf traces:
https://s3.amazonaws.com/random-bitbucket/base.etl
https://s3.amazonaws.com/random-bitbucket/base2.etl

And I’ve also attached a kernel debugger.

The culprit appears to be:

Child-SP RetAddr Call Site
fffff88005f86a88 fffff8000172b804 nt!MiGetProtoPteAddressExtended
fffff88005f86a90 fffff800016e5069 nt!MiCheckUserVirtualAddress+0x10c
fffff88005f86ac0 fffff800016d6cae nt!MmAccessFault+0x249
fffff88005f86c20 0000000140083750 nt!KiPageFault+0x16e
000000000012bb40 0000000000000000 0x1`40083750

Stepping through there is is some loop:

fffff8000171ff49 33c0 xor eax,eax fffff8000171ff4b 488b5c2430 mov rbx,qword ptr [rsp+30h]
fffff8000171ff50 4883c420 add rsp,20h fffff8000171ff54 5f pop rdi
fffff8000171ff55 c3 ret fffff8000171ff56 498b5a50 mov rbx,qword ptr [r10+50h]
fffff8000171ff5a 492bd9 sub rbx,r9 fffff8000171ff5d 48c1fb03 sar rbx,3
fffff8000171ff61 412b5a18 sub ebx,dword ptr [r10+18h] fffff8000171ff65 03da add ebx,edx
fffff8000171ff67 eb0b jmp nt!MiGetProtoPteAddressExtended+0x68 (fffff8000171ff74)

fffff8000171ff69 4d8b4010 mov r8,qword ptr [r8+10h] fffff8000171ff6d 2bd8 sub ebx,eax
fffff8000171ff6f 4d85c0 test r8,r8 fffff8000171ff72 74d5 je nt!MiGetProtoPteAddressExtended+0x3d (fffff8000171ff49) fffff8000171ff74 418b4018 mov eax,dword ptr [r8+18h]
fffff8000171ff78 3bd8 cmp ebx,eax fffff8000171ff7a 73ed jae nt!MiGetProtoPteAddressExtended+0x5d (fffff800`0171ff69)

At the top of this loop I have @rbx = 0x2487460 and it doesn’t exit until @rbx = 158 dropping by 0x@200 each time

It seems to be following some linked list, and what it wants is always at the end of a growing list. The size in @rbx is ~ the number of pages that have been touched.

Not sure what the significance of the 0x200 is?

I also get stack traces that look like:

fffff88005f867e8 fffff800016b4963 nt!DbgBreakPointWithStatus
fffff88005f867f0 fffff800016e3f41 nt! ?? ::FNODOBFM::string'+0x5d94 fffff88005f86820 fffff800016f5617 nt!KiSecondaryClockInterrupt+0x131 (TrapFrame @ fffff88005f86820)
fffff88005f869b0 fffff800016e5179 nt!MiDispatchFault+0x2e7
fffff88005f86ac0 fffff800016d6cae nt!MmAccessFault+0x359
fffff88005f86c20 0000000140083750 nt!KiPageFault+0x16e (TrapFrame @ fffff88005f86c20) 000000000012bb40 00000000`00000000 RapidResponse!DFS::Store::WarmUp+0x250

No idea what the FNODOBFM bit is?

Stack traces for when the process is ‘healthy’ do not show MiGetProtoPteAddressExtended on the stack.

Unfortunately since the file is 150GB in size, and has offsets within it for addressing I need the whole 150GB to be mapped in a contiguous chunk of address space. As far as I can tell there is no way to guarantee this using smaller sized views

0x200 is number of PTE per page table page (8 bytes each).

As I suspected, it’s a bad case of O(N^2).

Rob,

The experts here (as usual) have been spot on.

Take look at this article (http://support.microsoft.com/kb/2549369). It says FILE_FLAG_RANDOM_ACCESS will disable read-ahead but will increase the CC’s working set (large FS cache). But if this flag is not provided, then the views (256 KB) are unmapped and moved to the standby list after the read.

Also, have a look at…http://blogs.msdn.com/b/oldnewthing/archive/2012/01/20/10258690.aspx

Thanks,
Arvind

Do you mean 150MB? I do not believe it is possible to map 150GB into the
address space; even in Win64, I believe the mapping is limited. But if it
is truly 150GB, and you are seeing this behavior, it sounds like extreme
memory pressure is being exercised. If you have less than 150 GB of
physical memory, somebody has to be paged out. The list you describe
sounds like the LRU list; recent pages are added to the end (an O(1)
operation), and the candidate least-used page is removed from the head.
Here’s the problem: when a page is added to the list, it must first be
removed from the existing list. While a linear search is O(n/2),
pathological paging behavior can force it to O(n). In addition, a lot of
performance tradeoffs are based on “typical” behavior, where n is
typically “small”, for suitable definition of “small”. When you push n up
to the number of pages required to handle 150GB, you have probably
stressed the algorithm far outside its design limits.

I have encountered problems like this many times in my career, and
although the sizes were much smaller, so were the machines.

Now, there are many ways to approach this problem. But the one most under
your control is the file mapping. While there are lots of advantages to
mapping the entire file as contiguous bytes, the performance problems you
are seeing might be mitigated by throwing this assumption out. I have
often commented that optimizing lines of code at the line level, barring
complex inner loops of DSP processing, generally buys you single-digit
percentage improvements; architectural and high-level algorithmic changes
will buy you orders of magnitude performance improvement.

For example, transformingba matrix-multiply of two large matrices to
access the data in a cache-aware fashion, you get a significant
performance improvement; it is not unusual to see factors of 10 to 20.
So, if the fix involves masive work on the part of Microsoft to support
one customer who needs 150GB of mapping, it will probably go into the
Someday, Maybe pile. Don’t expect an improvement; you are many sigmas
from the mean value. So what you are left with is changing te algorithms
you use. If the cost of making the change gives you a high payoff, the
effort may be justified.

The other thing I used to tell my students was “ignore code size. Code
size is irrelevant in modern computers. Instead, worry about the data.
Data will kill you.”

For problems of the massive size you are dealing with, you have to realize
that there may be NO solution that is possible given existing machine
architectures. I was part of an OS performance team in the late 1970s,
and while others did the measurements, I was part of the “evaluation team”
that had to figure out what to do to change the performance. Like
Windows, we were a general-purpose OS whose users pushed the envelope. We
couldn’t solve the four-sigmas-out problems, and you’re far beyond four
sigmas. So you may have to reassess the decision about a single
contiguous mapping, no matter how unpleasant it sounds. I’ve written tens
of thousands of lines of code in the past 50 years solely to get
acceptable performance on machines that were too small. You have just
reinvented that problem, just with scaled-up numbers. As the technology
got bigger and faster, problems have kept pace and have gotten larger.
Your problem just got larger than the technology can support.

The other day, I bought an 8GB SD card for a development machine. The
machine is a 16MHz machine with 256K memory, in a form factor smaller than
one card from my first mainframe. When I started in this profession, 50
years ago, I doubt there was 8GB if you summed up the memory of all the
computers in the world. Every day, we faced the problem you face now.
And we didn’t have easy answers, either. “Big” filles were 100K. “Huge”
files might be 2MB (the limit of physical disk drives). And we might have
2K of buffer space. Modulo scaling, you’ve got the same problem.

Re-examine your design assumptions. It is clear the current ones don’t work.
joe

Odd … my last post seemed to disappear somewhere into the ether.

Anyway, after being dragged off this for a couple of days, I have some
xperf traces:
https://s3.amazonaws.com/random-bitbucket/base.etl
https://s3.amazonaws.com/random-bitbucket/base2.etl

And I’ve also attached a kernel debugger.

The culprit appears to be:

Child-SP RetAddr Call Site
fffff88005f86a88 fffff8000172b804 nt!MiGetProtoPteAddressExtended
fffff88005f86a90 fffff800016e5069 nt!MiCheckUserVirtualAddress+0x10c
fffff88005f86ac0 fffff800016d6cae nt!MmAccessFault+0x249
fffff88005f86c20 0000000140083750 nt!KiPageFault+0x16e
000000000012bb40 0000000000000000 0x1`40083750

Stepping through there is is some loop:

fffff8000171ff49 33c0 xor eax,eax fffff8000171ff4b 488b5c2430 mov rbx,qword ptr [rsp+30h]
fffff8000171ff50 4883c420 add rsp,20h fffff8000171ff54 5f pop rdi
fffff8000171ff55 c3 ret fffff8000171ff56 498b5a50 mov rbx,qword ptr [r10+50h]
fffff8000171ff5a 492bd9 sub rbx,r9 fffff8000171ff5d 48c1fb03 sar rbx,3
fffff8000171ff61 412b5a18 sub ebx,dword ptr [r10+18h] fffff8000171ff65 03da add ebx,edx
fffff8000171ff67 eb0b jmp nt!MiGetProtoPteAddressExtended+0x68 (fffff8000171ff74)

fffff8000171ff69 4d8b4010 mov r8,qword ptr [r8+10h] fffff8000171ff6d 2bd8 sub ebx,eax
fffff8000171ff6f 4d85c0 test r8,r8 fffff8000171ff72 74d5 je
nt!MiGetProtoPteAddressExtended+0x3d (fffff8000171ff49) fffff8000171ff74 418b4018 mov eax,dword ptr [r8+18h]
fffff8000171ff78 3bd8 cmp ebx,eax fffff8000171ff7a 73ed jae
nt!MiGetProtoPteAddressExtended+0x5d (fffff800`0171ff69)

At the top of this loop I have @rbx = 0x2487460 and it doesn’t exit until
@rbx = 158 dropping by 0x@200 each time

It seems to be following some linked list, and what it wants is always at
the end of a growing list. The size in @rbx is ~ the number of pages that
have been touched.

Not sure what the significance of the 0x200 is?

I also get stack traces that look like:

fffff88005f867e8 fffff800016b4963 nt!DbgBreakPointWithStatus
fffff88005f867f0 fffff800016e3f41 nt! ?? ::FNODOBFM::string'+0x5d94 fffff88005f86820 fffff800016f5617 nt!KiSecondaryClockInterrupt+0x131 (TrapFrame @ fffff88005f86820)
fffff88005f869b0 fffff800016e5179 nt!MiDispatchFault+0x2e7
fffff88005f86ac0 fffff800016d6cae nt!MmAccessFault+0x359
fffff88005f86c20 0000000140083750 nt!KiPageFault+0x16e (TrapFrame @
fffff88005f86c20) 000000000012bb40 00000000`00000000 RapidResponse!DFS::Store::WarmUp+0x250

No idea what the FNODOBFM bit is?

Stack traces for when the process is ‘healthy’ do not show
MiGetProtoPteAddressExtended on the stack.

Unfortunately since the file is 150GB in size, and has offsets within it
for addressing I need the whole 150GB to be mapped in a contiguous chunk
of address space. As far as I can tell there is no way to guarantee this
using smaller sized views


NTDEV is sponsored by OSR

Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

“But if it is truly 150GB, and you are seeing this behavior, it sounds like extreme
memory pressure is being exercised. If you have less than 150 GB of
physical memory, somebody has to be paged out.”

The OP stated the SUT got 500 GB of RAM.

> The other thing I used to tell my students was "ignore code size. Code size is irrelevant in modern

computers. Instead, worry about the data. Data will kill you.



Well, apparently not so many participants of this NG (particularly the ones employed by MSFT) attended your classes, right? Otherwise, they would have immediately recognized the profound idiocy of the very concept making kernel code pageable. In fact, the concept of pageable kernel data and stack is not much better off either, but this is the OS that has been designed by those who are known to think of UNIX as of a lifelong foe. As a result, the UNIX concept of daemons seems to be totally foreign for them - instead of delegating relatively cumbersome tasks to the userland (where anything that has not been explicitly locked may be paged out under the memory pressure) they perform them right in the kernel. As a result of this design, the kernel is unnecessary bloated and consumes so much memory that one may want to swap out to the disk quite large parts of it…

Anton Bassov

>to think of UNIX as of a lifelong foe. As a result, the UNIX concept of daemons seems to be totally

foreign for them

Yes, in Windows, daemons are called “services”.

  • instead of delegating relatively cumbersome tasks to the userland (where anything that has not
    been explicitly locked may be paged out under the memory pressure)

This is what Windows does, with PnP, for instance.

Pageable kernel is good for VM guests.


Maxim S. Shatskih
Microsoft MVP on File System And Storage
xxxxx@storagecraft.com
http://www.storagecraft.com

@Maxim:

Pageable kernel is good for VM guests.

I wish you haven’t bought into MS numerology BS. The whole sum of paged kernel code is under 20 or most likely even 10 MB. If such small difference makes for measurable performance hit over the noise level, the system is severely underpowered. Of course, with crappy Windows MM, even having lots of spare memory doesn’t help much anyway.

> Yes, in Windows, daemons are called “services”.

In a sense that Windows services run without user interaction, indeed, you can compare them to UNIX daemons. However, in terms of interactions with the kernel they are miles away from daemons. One of the cornerstones of UNIX philosophy is clear separation of policy and mechanism, so that all policy is meant to be implemented in daemons and kernel is meant to provide only a mechanism. However, Windows kernel is full of policy decisions, and presence of Windows services does not alleviate it in any possible way…

Pageable kernel is good for VM guests.

Stop listening to MFT marketing department and, instead, attend a couple of “Dr.Joe’s” classes…

There is no VM that will let you run modern Windows workstation guest with less than 512MB of memory, and the recommended minimum is 1G. In general, atypical modern VM workstation guest will have more memory than most physical machines did 10 years ago. As Alex pointed out, the amount of additional RAM you can get by making kernel pageable is unlikely to exceed few MB, which is simply negligible on such a system…

Anton Bassov