How to track down excessive VM page faults?

This is a question about how to catch a specific page fault on Win2k,
although it happens on other flavours (notably 98) as well.

I’m developing an application which uses OpenGL for visualization.
Sometimes, on some systems, this application will start hitting 100 page
faults per second and slow to a crawl. “Ah,” you will think, “a memory
leak.”

Well, my pool size isn’t going up (much) to account for this. Even more
curiously, if I quit my application, and re-start it, it will still be slow,
or very quickly again become slow. Re-booting the machine “solves” the
problem for 15-30 minutes. Thus, something at the kernel level that either
does not get a process termination message, or is happily ignoring the same,
has to be causing this problem.

Nothing else is going on on the machine, and these page faults appear to be
the only ones happening (according to the process monitor, anyway).

My guess is, for various reasons, the texture management in the OpenGL
driver we’re using. However, that may just be slander, and even if it isn’t,
it serves me no good unless I can actually prove specifically how this is
the case. My current idea for how to do this is to remote debug a system and
put a breakpoint in the page fault handler of the kernel when the problem
has started exhibiting itself, and then try to get a stack trace from there.
Repeat 10 times and hopefully the culprit will be statistically clear.

I’ve looked through the WinDbg documentation and Win2k DDK, as well as
searched MSDN, googled around the web and the archive of this list, but I
can’t find any good pointers on how to do this (or what else could be the
culprit). Any suggestions, pointers, help or insight you may be able to
share would be much appreciated!

“WinDbg” wrote in message
news:xxxxx@windbg…

[snip]

> My guess is, for various reasons, the texture management in the OpenGL
> driver we’re using. However, that may just be slander, and even if it
isn’t,
> it serves me no good unless I can actually prove specifically how this is
> the case. My current idea for how to do this is to remote debug a system
and
> put a breakpoint in the page fault handler of the kernel when the problem
> has started exhibiting itself, and then try to get a stack trace from
there.
> Repeat 10 times and hopefully the culprit will be statistically clear.

[snip]

I don’t know the names of the functions, so you’ll have to do some detective
work, but you can put a breakpoint on any function for which you have the
name, so run strings over the NTDLL pdb and figure out likely candidates for
the page fault handler. Then put a breakpoint on one of them, then see what
happens. If you missed on the first one, try another. If you find one that
works, look at the stack and see if you can put your breakpoint a bit lower
in the stack so you’re closer to the culprit.

Only put a breakpoint on one at a time, so you don’t get overwhelmed with
having to hold down the F5 key to get it to do anything at all.

Hope this helps,

Phil

Philip D. Barila
Seagate Technology, LLC

Thanks.

For reference, it’s nt!KiTrap0E. Now for the joy of unwinding from
an interrupt frame into the previous user stack space. While I’ve
hacked in kernel space on x86 before, it’s never been on NT. Gotta
learn it some time, and now is as good as any I guess :slight_smile:

Also, I have a feeling KiTrap0E might be used by “other things” as
well, say for garbage collection or memory mapping, so I don’t know
how tenacious I have to be in filtering the “right” exceptions. I
really wish VTune had a “page fault” event counter!

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com]On Behalf Of Phil Barila
Sent: Friday, April 05, 2002 7:46 AM
To: Kernel Debugging Interest List
Subject: [windbg] Re: How to track down excessive VM page faults?

“WinDbg” wrote in message
> news:xxxxx@windbg…
>
> [snip]
>
> > My guess is, for various reasons, the texture management in the OpenGL
> > driver we’re using. However, that may just be slander, and even if it
> isn’t,
> > it serves me no good unless I can actually prove specifically
> how this is
> > the case. My current idea for how to do this is to remote debug a system
> and
> > put a breakpoint in the page fault handler of the kernel when
> the problem
> > has started exhibiting itself, and then try to get a stack trace from
> there.
> > Repeat 10 times and hopefully the culprit will be statistically clear.
>
> [snip]
>
> I don’t know the names of the functions, so you’ll have to do
> some detective
> work, but you can put a breakpoint on any function for which you have the
> name, so run strings over the NTDLL pdb and figure out likely
> candidates for
> the page fault handler. Then put a breakpoint on one of them,
> then see what
> happens. If you missed on the first one, try another. If you
> find one that
> works, look at the stack and see if you can put your breakpoint a
> bit lower
> in the stack so you’re closer to the culprit.
>
> Only put a breakpoint on one at a time, so you don’t get overwhelmed with
> having to hold down the F5 key to get it to do anything at all.
>
> Hope this helps,
>
> Phil
>
> Philip D. Barila
> Seagate Technology, LLC
>
>
>
> —
> You are currently subscribed to windbg as: xxxxx@mindcontrol.org
> To unsubscribe send a blank email to %%email.unsub%%
>