Understanding poor performance of memory mapped files in system cache

Ahem.

The Windows OS comprises 3 layers: The HAL, the Kernel, and the Executive.

In general, and as a matter of architecture, necessary OS-level policy code is restricted to the Executive. In general, policy is not allowed in the Kernel.

Policy affecting user preferences and activities (that is, not OS policy) in Windows – again, generally and as a matter of architecture – is typically restricted to user-mode. This is why you see experienced people here rebel at the idea of “setting system power policy from the driver” and such things. This, clearly, according to Windows architecture, belongs in user-mode.

The lines are constantly subject to debate, and practice doesn’t always precisely match architecture. The biggest example of this is OS scheduling policy, much of which is in fact in the Kernel (it is, of course, also impacted by the Executive in major ways).

Peter
OSR

xxxxx@broadcom.com wrote:

I wish you haven’t bought into MS numerology BS. The whole sum of paged kernel code is under 20 or most likely even 10 MB. If such small difference makes for measurable performance hit over the noise level, the system is severely underpowered.

That’s PER GUEST. If you’re running a server with 100 VMs, that makes
2GB of physical memory that can now be used for something else.

I think you’re focusing on the types of clients you’re used to seeing,
and ignoring the megaclients that Microsoft has to satisfy.


Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

Tim Roberts wrote:

Alex Grig wrote:

> I wish you haven’t bought into MS numerology BS.
> The whole sum of paged kernel code is under 20 or
> most likely even 10 MB. If such small difference makes
> for measurable performance hit over the noise level, the
> system is severely underpowered.

That’s PER GUEST. If you’re running a server with 100 VMs,
that makes 2GB of physical memory that can now be used for
something else.

101 VMs instead of 100?

How much total physical memory would such a system have… to support 100 VMs? 128GB? 256GB? 512GB? If any of those is correct, then 2GB is less than 2% of the total amount of memory, and therefore unimportant.

Further… The last time I looked, and it was a long while ago I admit, the conditions under which Windows OS CODE itself is paged (we’re not talking about DRIVERS here) are limited to the point where no realistic Windows system actually paged kernel code. Given this limit and the use of large pages, I don’t THINK you’ll actually see NTOS code paged in real circumstances.

Real observations, with configuration details, to the contrary welcome.

Perhaps Mr. Oshins will become involved at some point, now that we’ve mentioned VMs prominently.

Peter
OSR

>That’s PER GUEST. If you’re running a server with 100 VMs, that makes
2GB of physical memory that can now be used for something else.

I doubt anybody ever uses such fine granularity to specify memory allotment. “Would you like 1024 MB or maybe 1008? If you use 1008, we may be able to run another little VM in addition to your other 64”. But even if you give it 16 MB less, the performance difference will still be below noise level.

If your VM host uses memory over-subscription, it needs to also use memory deduplication. It’s pretty easy to de-duplicate code pages.

Dude, I’ll respond if it’ll make you happy. But I find that arguing with people who made their minds up years ago to be relatively unrewarding.

  • Jake Oshins

-----Original Message-----
From: xxxxx@lists.osr.com [mailto:xxxxx@lists.osr.com] On Behalf Of xxxxx@osr.com
Sent: Wednesday, November 20, 2013 9:57 AM
To: Windows System Software Devs Interest List
Subject: RE:[ntdev] Understanding poor performance of memory mapped files in system cache

How much total physical memory would such a system have… to support 100 VMs? 128GB? 256GB? 512GB? If any of those is correct, then 2GB is less than 2% of the total amount of memory, and therefore unimportant.

Further… The last time I looked, and it was a long while ago I admit, the conditions under which Windows OS CODE itself is paged (we’re not talking about DRIVERS here) are limited to the point where no realistic Windows system actually paged kernel code. Given this limit and the use of large pages, I don’t THINK you’ll actually see NTOS code paged in real circumstances.

Real observations, with configuration details, to the contrary welcome.

Perhaps Mr. Oshins will become involved at some point, now that we’ve mentioned VMs prominently.

Peter
OSR


NTDEV is sponsored by OSR

Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer

>But I find that arguing with people who made their minds up years ago to be relatively unrewarding.

This works both ways. The ivory tower must be a boring place, though.

Arvind Pandurang Dalvi

The experts here (as usual) have been spot on.

Take look at this article (http://support.microsoft.com/kb/2549369). It says
FILE_FLAG_RANDOM_ACCESS will disable read-ahead but will increase the CC’s
working set (large FS cache). But if this flag is not provided, then the views
(256 KB) are unmapped and moved to the standby list after the read.

Thanks – I’ve tried various combinations of flags and I’m seeing this behaviour with and without FILE_FLAG_SEQUENTIAL_SCAN explicitly specified

Joseph M. Newcomer

Do you mean 150MB? I do not believe it is possible to map 150GB into the
address space; even in Win64,

No, 150GB … fortunately it is possible (and we’ve done even larger). Coming to the conclusion it might not be wise though…

> The Windows OS comprises 3 layers: The HAL, the Kernel, and the Executive.

I don’t know whether, for the practical purposes, such distinction really makes sense for us, taking into consideration that

A. The Kernel, and the Executive are implemented by the same module, i.e.ntoskrnl.exe

B. Despite being implemented by a separate module HAL, is integrated with .ntoskrnl.exe so tightly that
ntoskrnl.exe and hal.dll both export and import to/from one another

I guess such a distinction makes sense only for the guys who actually maintain ntoskrnl.exe and hal.dl code.

Anton Bassov

>No, 150GB … fortunately it is possible (and we’ve done even larger). Coming
to the conclusion it might not be wise though…

I suggest you forgo the file mapping, and go with plain VirtualAlloc, and just read the file into it. If you issue large reads, it will give you as good and better than any MMF with read ahead could (even SATA drives these days do 150+ MB/s sustained linear speed).

If you don’t need to share or reuse the data in memory, just use VirtualAlloc. Use large pages, this will make the buffer resident (non-pageable).

If you need to share the data between processes, create a named no-file section with large pages, and read the file into it.

You’ll only really want the file mapped directly if you want to cache it for separate runs of the process. I don’t think the cache manager will cache the entire file, anyway.

Thank you for taking the time to collect the traces.

Here’s what’s happening here. I believe the program you’re using to unzip the file does not pre-extend it to its final size before extracting the data. Instead it issues lots of extending cached writes. This creates many fragments in the memory manager’s structures representing the mapped file, resulting in a linear walk on every page fault, and the overall N^2 time. This fragmented state is not persisted across reboots, so if you reboot the system and access the same file again you should no longer be able to reproduce the problem.

We will fix the N^2 behavior in a future release. For now unfortunately the only workaround I can think of is to use a different program to extract the file. Hopefully at least some of these programs issue SetEndOfFile before decompressing, since it’s a generally useful thing to do in order to reduce on-disk fragmentation.

-----Original Message-----
From: xxxxx@lists.osr.com [mailto:xxxxx@lists.osr.com] On Behalf Of xxxxx@kinaxis.com
Sent: Monday, November 18, 2013 5:48 PM
To: Windows System Software Devs Interest List
Subject: RE:[ntdev] Understanding poor performance of memory mapped files in system cache

Odd … my last post seemed to disappear somewhere into the ether.

Anyway, after being dragged off this for a couple of days, I have some xperf traces:
https://s3.amazonaws.com/random-bitbucket/base.etl
https://s3.amazonaws.com/random-bitbucket/base2.etl

And I’ve also attached a kernel debugger.

The culprit appears to be:

Child-SP RetAddr Call Site
fffff88005f86a88 fffff8000172b804 nt!MiGetProtoPteAddressExtended
fffff88005f86a90 fffff800016e5069 nt!MiCheckUserVirtualAddress+0x10c
fffff88005f86ac0 fffff800016d6cae nt!MmAccessFault+0x249
fffff88005f86c20 0000000140083750 nt!KiPageFault+0x16e
000000000012bb40 0000000000000000 0x1`40083750

+1 for best post today

“Jake Oshins” wrote in message news:xxxxx@ntdev…

Dude, I’ll respond if it’ll make you happy. But I find that arguing with
people who made their minds up years ago to be relatively unrewarding.

  • Jake Oshins

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of xxxxx@osr.com
Sent: Wednesday, November 20, 2013 9:57 AM
To: Windows System Software Devs Interest List
Subject: RE:[ntdev] Understanding poor performance of memory mapped files in
system cache

How much total physical memory would such a system have… to support 100
VMs? 128GB? 256GB? 512GB? If any of those is correct, then 2GB is less
than 2% of the total amount of memory, and therefore unimportant.

Further… The last time I looked, and it was a long while ago I admit, the
conditions under which Windows OS CODE itself is paged (we’re not talking
about DRIVERS here) are limited to the point where no realistic Windows
system actually paged kernel code. Given this limit and the use of large
pages, I don’t THINK you’ll actually see NTOS code paged in real
circumstances.

Real observations, with configuration details, to the contrary welcome.

Perhaps Mr. Oshins will become involved at some point, now that we’ve
mentioned VMs prominently.

Peter
OSR


NTDEV is sponsored by OSR

Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

On my system I have 70 MB of driver code+data (ignoring the kernel and the HAL, which in this case are mapped with large pages).

16.5 MB of that is pageable. 12 MB of that is still unreferenced 20 minutes after trimming all system memory.

The top-selling Windows device in the world (Nokia Lumia 520) has 512 MB of memory.

Would the user notice if 12 MB of memory disappeared from his phone? Quite possibly yes (e.g. some large game might refuse to run). But even if not, if you allow small regressions like that in a project the size of Windows, it’s a sure way to end up with a release that has double or triple the memory requirements compared to the previous version.

-----Original Message-----
From: xxxxx@lists.osr.com [mailto:xxxxx@lists.osr.com] On Behalf Of xxxxx@broadcom.com
Sent: Wednesday, November 20, 2013 6:38 AM
To: Windows System Software Devs Interest List
Subject: RE:[ntdev] Understanding poor performance of memory mapped files in system cache

@Maxim:

Pageable kernel is good for VM guests.

I wish you haven’t bought into MS numerology BS. The whole sum of paged kernel code is under 20 or most likely even 10 MB. If such small difference makes for measurable performance hit over the noise level, the system is severely underpowered. Of course, with crappy Windows MM, even having lots of spare memory doesn’t help much anyway.

>I have 70 MB of driver code+data

The talk was mostly (and only) about paged code.

(e.g. some large game might refuse to run).

Well, pagefile is still there. Why would a game care about available working set size (which it cannot detect, anyway)?

> 101 VMs instead of 100?

You are a way too optimistic…

I think that Tim’s logic is just faulty. The problem is that memory is normally reserved by the VMs on per-guest basis. If you allocate, say, 2G for a guest A, it is going to be A’s memory that guest B is unable to use, no matter how little RAM guest A currently uses. Certainly, paravirtualized guest can implement a balloon driver that releases memory to the hypervisor upon the request, but if you think about it a bit you will realize how ridiculous releasing memory would be, from the guest’s perspective - a guest that pages out the kernel code in order to free some RAM is not going to do so only in order to be able to release it to hypervisor, right…

In other words, the situation is exactly the same that you would face with the physical machines - you cannot combine per-machine memory savings in order to evaluate memory saving on cluster basis. Doing so would be pretty similar to calculating an average temperature and/or blood pressure among the hospital patients…

Anton Bassov

You’re right, it won’t happen the way you described. It will be more like this:

The guest will notice that some pageable kernel/driver code is not getting accessed. After a while, unless the guest already has sufficient amounts of unused memory, it will trim those pages, and eventually write them to the pagefile. The pages will then be put on the standby list and become available for allocation. If the host then decides that it needs more memory for other VMs and starts ballooning memory out of the guest, it will be able to obtain a few megabytes more than if the code was non-pageable.

The situation is different for VMs vs. physical machines because a typical physical machine probably spends more than 80% of the time using less than 20% of its resources. With things like memory ballooning, VMs can be managed much more tightly, so the same absolute reduction in footprint has a greater relative impact.

-----Original Message-----
From: xxxxx@lists.osr.com [mailto:xxxxx@lists.osr.com] On Behalf Of xxxxx@hotmail.com
Sent: Thursday, November 21, 2013 8:05 AM
To: Windows System Software Devs Interest List
Subject: RE:[ntdev] Understanding poor performance of memory mapped files in system cache

101 VMs instead of 100?

You are a way too optimistic…

I think that Tim’s logic is just faulty. The problem is that memory is normally reserved by the VMs on per-guest basis. If you allocate, say, 2G for a guest A, it is going to be A’s memory that guest B is unable to use, no matter how little RAM guest A currently uses. Certainly, paravirtualized guest can implement a balloon driver that releases memory to the hypervisor upon the request, but if you think about it a bit you will realize how ridiculous releasing memory would be, from the guest’s perspective - a guest that pages out the kernel code in order to free some RAM is not going to do so only in order to be able to release it to hypervisor, right…

In other words, the situation is exactly the same that you would face with the physical machines - you cannot combine per-machine memory savings in order to evaluate memory saving on cluster basis. Doing so would be pretty similar to calculating an average temperature and/or blood pressure among the hospital patients…

Anton Bassov

>If the host then decides that it needs more memory for other VMs
and starts ballooning memory out of the guest, it will be able to obtain a few
megabytes more than if the code was non-pageable.

Still, if such small variation is expected to produce noticeable gains, then your system is too close to the cliff, where its performance may hit a bottleneck and then fall precipitously because of thrashing.

>However, Windows kernel is full of policy decisions

Same is Linux - MM, scheduling quantums and so on.

Daemons in UNIX were NOT created to separate policy and mechanism (and UNIXen violate this good design rule as often as Windows).

Daemons in UNIX were created EXACTLY with the same purpose as Windows services - to run independently on what user(s) is logged on.

Let’s stop counting misdesigns and kludges in these OSes. All of them have plenty. Nevertheless, all of them can do work and are trustworthy (if properly handled).


Maxim S. Shatskih
Microsoft MVP on File System And Storage
xxxxx@storagecraft.com
http://www.storagecraft.com

>I don’t think the cache manager will cache the entire file, anyway.

The elementary way of making 2008 (R2 or not) Cc/Mm to go thrashing:

  • create many multi-GB files in the directory (TBs of data)
  • open each of them (there can be 100s of them) and read several bytes from the header
  • this must be done from the SMB client

Voila the thrashing.


Maxim S. Shatskih
Microsoft MVP on File System And Storage
xxxxx@storagecraft.com
http://www.storagecraft.com

>some large game might refuse to run). But even if not, if you allow small regressions like that in a

project the size of Windows

Absolutely and really so.

Also note that pageable kernel code does not introduce any major nuisance, and does not introduce ANY nuisance if you use DV constantly.


Maxim S. Shatskih
Microsoft MVP on File System And Storage
xxxxx@storagecraft.com
http://www.storagecraft.com