How is Committed Pages > Commit Limit?!?!

I have about 30 Windows 2008 and 2003 server systems that have gotten into a bad state where the system cannot be logged into or RDPed into. Services that are running mostly work, but new processes cannot be started. It is also highly correlated with an uptime of 50-60 days. Based on past experience, I expect some sort of memory leak to be the root cause. And also based on past experience, it’s likely one of our drivers or services that is the culprit (it’s an “embedded” system with custom hardware but a standard PC motherboard.) [I originally had this issue on WINDBG when I thought the issue was something else, but now think that it’s something in the driver or internals arena.]

I was able to get a memory dump of a few systems and WinDbg !vm command certainly indicates something of the sort based on the errors. However, all of the various Usages are far less than the Max/Limits, so there’s no obvious leak.

*** Virtual Memory Usage ***
Physical Memory: 521669 ( 2086676 Kb)
Page File: ??\C:\pagefile.sys
Current: 2393876 Kb Free Space: 2357824 Kb
Minimum: 2393876 Kb Maximum: 6260028 Kb
Available Pages: 88804 ( 355216 Kb)
ResAvail Pages: 988943 ( 3955772 Kb)
Locked IO Pages: 0 ( 0 Kb)
Free System PTEs: 386694 ( 1546776 Kb)
******* 681703 system cache map requests have failed ******
Modified Pages: 736 ( 2944 Kb)
Modified PF Pages: 736 ( 2944 Kb)
NonPagedPool Usage: 15092 ( 60368 Kb)
NonPagedPool Max: 386063 ( 1544252 Kb)
PagedPool 0 Usage: 6444 ( 25776 Kb)
PagedPool 1 Usage: 6365 ( 25460 Kb)
PagedPool 2 Usage: 1002 ( 4008 Kb)
PagedPool 3 Usage: 907 ( 3628 Kb)
PagedPool 4 Usage: 650 ( 2600 Kb)
PagedPool Usage: 15368 ( 61472 Kb)
PagedPool Maximum: 523264 ( 2093056 Kb)
********** 825082 pool allocations have failed **********
Session Commit: 2486 ( 9944 Kb)
Shared Commit: 8514 ( 34056 Kb)
Special Pool: 0 ( 0 Kb)
Shared Process: 5850 ( 23400 Kb)
PagedPool Commit: 15382 ( 61528 Kb)
Driver Commit: 5060 ( 20240 Kb)
Committed pages: 4294962071 (17179848284 Kb)
Commit limit: 1108180 ( 4432720 Kb)
********** Number of committed pages is near limit ********
********** 10528464 commit requests have failed **********
Total Private: 456244 ( 1824976 Kb)

But look at the Committed Pages! It is nearly 4000 times larger than Commit Limit! I’ve been doing a lot of reading and it doesn’t seem like it’s possible to get into the state that I see where the Committed Pages is at the maximum 16 TB (4294962071 == 0xFFFFEB97) but my system has a Commit Limit of a reasonable 4 GB. How are my systems committing more than the commit limit?!?!

Committed pages: 4294962071 (17179848284 Kb)
Commit limit: 1108180 ( 4432720 Kb)
********** Number of committed pages is near limit ********
********** 10528464 commit requests have failed **********

Also, this was an interesting tidbit in that my systems show 3 of the 4 types of commit request failures.

0: kd> dd nt!MiChargeCommitmentFailures
81d51f80 0093b93f 00000000 000c7189 00007c08

MiChargeCommitmentFailures[0] - If the system failed a commit request and an expansion of the pagefile has failed.
MiChargeCommitmentFailures[1] - If the system failed a commit and we have already reached the maximum pagefile size.
MiChargeCommitmentFailures[2] - If the system failed a commit while the pagefile lock is held.
MiChargeCommitmentFailures[3] - If the system failed a commit and the NewCommitValue is less than or equal to CurrentCommitValue.

Also odd is the number of pool allocations that failed when the pool usages are so much less than the pool maximums, but I suspect that’s just a failure to grow the pool larger than the current size (far less than the maximum).

I’ve gone through each process, and they all have reasonable memory and virtual memory usage. No handle leaks, no pool leaks, and so on.

I cannot figure out what is wrong other than the 16 TB Committed Pages.

Any advice, troubleshooting ideas, or anything will be appreciated!

Mr. Wynnell… You always manage to ask the most interesting questions.

Having said that, I’ve got nothing that’ll help with your question :slight_smile:

Happy New Year,

Peter
OSR
@OSRDrivers

Can you connect PerfMon to them and pull the memory stats? Some versions of WinDbg have broken !vm command. With PerfMon you may be able to see which process had overcommitted memory.

> Can you connect PerfMon to them and pull the memory stats?

Good thinking. Yes, PerfMon is able to connect to them when in this state. I’ve looked through many PerfMon counters and found nothing surprising or problematic. Unfortunately, I don’t have any systems in the state right now, so I can’t double-check my work right now to see if I missed something. But I’ll do that next time.

Event Viewer can also connect, and there’s nothing surprising except that suddenly at one point, the system throws a popup that it cannot increase the paging file, and many services start complaining that they cannot get memory, and so on. I have Process Tracking enabled, and there is no process started anywhere near that time.

Some versions of WinDbg have broken !vm command.

I’ve looked at the dump with multiple recent versions (6.12, 6.2, 6.3, 10.0) and they all give the same !vm results.

Additionally, that value comes right out of memory, so there’s not much opportunity to get it wrong.

0: kd> dd nt!MmTotalCommittedPages L1
81d51f98 ffffeb97

There’s no reason to suspect memory corruption since the variables before (nt!MmTotalCommitLimit) and after (nt!MmTotalCommitLimitMaximum) that are completely normal-looking and what I’d expect.

CommittedPages look corrupted.

I suspect there’s been a pattern of extra subtraction from MmTotalCommittedPages during freeing of at least 16 MB, and the next time it wrap around to the big unsigned number. Do your drivers map paged pool to usermode memory? Somehow uncommit was counted twice many times over.

Interestingly, I just manually “reproduced” it by being on a good system and just changing the variable in the debugger:

ed nt!MmTotalCommittedPages fff70000

I was watching PerfMon and so on, and the symptoms seemed the same, although mind you that I’ve never been logged in during the issue before. I got popups, very little “worked”, but I did see the huge Commited Bytes value in PerfMon, showing that PerfMon gets the value from there as well.

I then edited it back to the original value and the system returned to normal.

So, at least that supports the notion that that seems to be the only apparent issue.

I may have one of my drivers periodically monitor the value, log it, and crash the system if it gets huge so that I get a dump immediately after it occurring. Until I know how to make it occur on internal systems, I have to do all of my debugging in the field at our customers, so a crash dump works about as well as I can hope.

nt!MmTotalCommittedPages is not the only problem. You have more Resident
Available Pages than Physical Memory

988943 == F170F (Just one bit and it will look OK)

*** Virtual Memory Usage ***
Physical Memory: 521669 ( 2086676 Kb)
Page File: ??\C:\pagefile.sys
Current: 2393876 Kb Free Space: 2357824 Kb
Minimum: 2393876 Kb Maximum: 6260028 Kb
Available Pages: 88804 ( 355216 Kb)
ResAvail Pages: 988943 ( 3955772 Kb)