tracking down an MDL leak

Dave_Vitek · February 3, 2010, 12:45pm

Hi all,

I was wondering if anyone could offer some advice on tracking down the
party causing an MDL leak on a production system – eventually
exhausting the non-paged pool. This is on an up to date x86 windows xp
system.

Here are the first few lines of poolused:

lkd> !poolused 3
Sorting by NonPaged Pool Consumed

Pool Used:
NonPaged Paged
Tag Allocs Frees Diff Used Allocs Frees Diff
Used
Mdl 6427733 4840022 1587711 203227008 0 0
0 0 Io, Mdls
MmCm 566 15 551 10871056 0 0
0 0 Calls made to MmAllocateContiguousMemory , Binary: nt!mm
TCPt 411941 411909 32 504704 1 1
0 0 TCP/IP network protocol , Binary: TCP
Even 274011252 274002062 9190 443200 0 0
0 0 Event objects
FSrm 800111 800079 32 397424 581047 581022 25
52856 File System Run Time , Binary: nt!fsrtl

The machine takes a couple weeks to get into this state. We do have a
custom file system filter driver that runs on this machine. However, I
believe all its allocations are through ExAllocatePoolWithTag with its
own tag. The driver is installed, started, stopped, and uninstalled
periodically as the machine runs (the machine is a buildbot that does
continuous integration – basically building sandboxes over and over).

I set up verifier.exe to track pool use for all drivers on the system.
verifier.exe reports that our custom driver peaks at 112 bytes in the
non-paged pool (1 allocation). Right now the machine has been up for a
few days. Yesterday, total NP Pool was at 80M. Today it is past 110M.

The sum of the “NP Pool - bytes allocated” entries in verifier.exe
doesn’t come anywhere near accounting for the 110M. The biggest
consumer is ntfs.sys with 10M, and it doesn’t seem to be changing.

The sum of the NP Pool entries in taskman comes to about 1M. The total
number of handles isn’t trending upwards (sits at about 8500).

Is there another type of instrumentation I could be checking? Should I
be trying to find and dissect the MDLs themselves with a debugger
(resources or examples are appreciated)?

Igor_Sharovar · February 3, 2010, 12:55pm

!memusage of WinDbg may provide additional information for you.

Igor Sharovar

Scott_Noone_OSR · February 3, 2010, 1:41pm

Can you post the full !poolused output?

Do you build any IRPs in your driver and send them down to the lower
drivers? That might provoke the underlying layers to allocate MDLs that you
are potentially in charge of freeing.

-scott

–
Scott Noone
Consulting Associate
OSR Open Systems Resources, Inc.
http://www.osronline.com

“Dave Vitek” wrote in message news:xxxxx@ntdev…
> Hi all,
>
> I was wondering if anyone could offer some advice on tracking down the
> party causing an MDL leak on a production system – eventually exhausting
> the non-paged pool. This is on an up to date x86 windows xp system.
>
> Here are the first few lines of poolused:
>
> lkd> !poolused 3
> Sorting by NonPaged Pool Consumed
>
> Pool Used:
> NonPaged Paged
> Tag Allocs Frees Diff Used Allocs Frees Diff
> Used
> Mdl 6427733 4840022 1587711 203227008 0 0 0
> 0 Io, Mdls
> MmCm 566 15 551 10871056 0 0 0
> 0 Calls made to MmAllocateContiguousMemory , Binary: nt!mm
> TCPt 411941 411909 32 504704 1 1 0
> 0 TCP/IP network protocol , Binary: TCP
> Even 274011252 274002062 9190 443200 0 0 0
> 0 Event objects
> FSrm 800111 800079 32 397424 581047 581022 25
> 52856 File System Run Time , Binary: nt!fsrtl
>
>
> The machine takes a couple weeks to get into this state. We do have a
> custom file system filter driver that runs on this machine. However, I
> believe all its allocations are through ExAllocatePoolWithTag with its own
> tag. The driver is installed, started, stopped, and uninstalled
> periodically as the machine runs (the machine is a buildbot that does
> continuous integration – basically building sandboxes over and over).
>
> I set up verifier.exe to track pool use for all drivers on the system.
> verifier.exe reports that our custom driver peaks at 112 bytes in the
> non-paged pool (1 allocation). Right now the machine has been up for a
> few days. Yesterday, total NP Pool was at 80M. Today it is past 110M.
>
> The sum of the “NP Pool - bytes allocated” entries in verifier.exe doesn’t
> come anywhere near accounting for the 110M. The biggest consumer is
> ntfs.sys with 10M, and it doesn’t seem to be changing.
>
> The sum of the NP Pool entries in taskman comes to about 1M. The total
> number of handles isn’t trending upwards (sits at about 8500).
>
> Is there another type of instrumentation I could be checking? Should I be
> trying to find and dissect the MDLs themselves with a debugger (resources
> or examples are appreciated)?
>
>
>

Dave_Vitek · February 5, 2010, 12:26pm

Thanks for the ideas guys. I disabled the verifier (windbg was not happy when it was enabled) and am waiting for the machine to get back into the bad state, so it may be a week or two before I can post the full !poolused + !vm + partial !memusage output. The machine hasn’t really increased its NP pool use since it rebooted yesterday (low workload perhaps).

I am assuming !memusage rows like:
Control Valid Standby Dirty Shared Locked PageTables name
86a66290 0 4 0 0 0 0 mapped_file( vsnprintf.c )

Are never to blame since all the used space is under “standby” which I am guessing equates to the FS cache?

The author of our home grown driver tells me “We only pass reparse back UP the stack.” I’ll probably take our custom driver out of the mix the next time the machine cycles to rule it out.

Maxim_S_Shatskih · February 5, 2010, 2:32pm

> Thanks for the ideas guys. I disabled the verifier (windbg was not happy when it was enabled) and

am waiting for the machine to get back into the bad state, so it may be a week or two before I can

You can wrap PMDL to your structure, and use allocator of this structure + IoAllocateMdl instead of pure IoAllocateMdl.

Then monitor the leaks of this structure too.

–
Maxim S. Shatskih
Windows DDK MVP
xxxxx@storagecraft.com
http://www.storagecraft.com

Dave_Vitek · February 9, 2010, 1:36pm

Hi All,

I have posted various WinDbg output here:

http://www.grammatech.com/vitek/memusage.txt

I removed the mapped_file file names to avoid breaking any NDAs. I can look up specific ones if requested.

I don’t know if this is useful, but here it is. The first thing that breaks from the failing allocations usually seems to be SMB access. The workload involves quite a bit of SMB activity (reads and writes on a file system mounted with ‘net use’).

Maxim:

We don’t actually allocate any MDLs in our driver (at least not directly). It is a pretty small thing. I would need a way of monitoring MDLs allocated by third-party and OS code.

(See http://www.osronline.com/showThread.CFM?link=175406 for thread history)

Pavel_Lebedinsky · February 11, 2010, 12:52am

> We don’t actually allocate any MDLs in our driver (at least not directly).

It is a pretty small thing. I would need a way of monitoring MDLs
allocated by third-party and OS code.

On XP I think the only way to do that would be to connect a kernel
debugger and set breakpoints on IoAllocateMdl/IoFreeMdl. The output
of !poolused shows that roughly 1 out of 4 Mdl allocations is leaked,
so there is a good chance you might be able to catch the leaking driver
this way.

You could also use !poolfind and dump a few MDLs at random to
see if anything stands out (e.g. they might all have the same EPROCESS
pointer, or an unusual combination of flags, etc).

–
Pavel Lebedinsky/Windows Fundamentals Test
This posting is provided “AS IS” with no warranties, and confers no rights.

Scott_Noone_OSR · February 11, 2010, 6:58am

Another issue with tracking down MDL leaks is that IoAllocateMdl/IoFreeMdl
use a lookaside list for MDLs under a certain size, so when you go picking
through outstanding MDL allocations you may find ones that aren’t actually
leaked.

I tracked down an issue similar to this not that long ago, though in that
case the leak was much faster which always makes things easier. In order to
track down the cause of the leak I patched IoAllocateMdl/IoFreeMdl via the
debugger early in the boot process to always allocate the MDL structure out
of pool. Then using PoolTag I could see exactly what operations caused a
leak in the MDL count and could capture a trace of the allocs/frees during
those operations using a couple of simple breakpoints:

bp nt!ioallocatemdl “k;g”
bp nt!iofreemdl “k;g”

Not sure if this will work for you in this situation, but it did end up
working out nicely for me.

-scott

–
Scott Noone
Consulting Associate
OSR Open Systems Resources, Inc.
http://www.osronline.com

“Pavel Lebedinsky” wrote in message
news:xxxxx@ntdev…
>> We don’t actually allocate any MDLs in our driver (at least not
>> directly).
>> It is a pretty small thing. I would need a way of monitoring MDLs
>> allocated by third-party and OS code.
>
>
> On XP I think the only way to do that would be to connect a kernel
> debugger and set breakpoints on IoAllocateMdl/IoFreeMdl. The output
> of !poolused shows that roughly 1 out of 4 Mdl allocations is leaked,
> so there is a good chance you might be able to catch the leaking driver
> this way.
>
> You could also use !poolfind and dump a few MDLs at random to
> see if anything stands out (e.g. they might all have the same EPROCESS
> pointer, or an unusual combination of flags, etc).
>
> –
> Pavel Lebedinsky/Windows Fundamentals Test
> This posting is provided “AS IS” with no warranties, and confers no
> rights.
>
>

anton_bassov · February 11, 2010, 11:39am

> I was wondering if anyone could offer some advice on tracking down the party causing an

MDL leak on a production system – eventually exhausting the non-paged pool.

I really have no idea what makes you so certain about relating this problem to MDLs (in fact, to ANY direct NP allocation). Judging from the pace at which leak gets accumulated, direct NP allocations are unlikely to be the source of a problem here - if you were simply failing to free NP memory you would apparently run out of NP memory much faster, because NP allocations by drivers tend to be short-lived and frequent
(and, in case of MDL, quite large as well).

However, in your particular case it takes two weeks to run out of NP memory, which strongly suggests that
leak source is more subtle than a simple failure to free memory . The very first candidate that gets into my head under these circumstances is Object Manager - if you reference any kernel object that gets allocated from non-paged pool (FILE_OBJECT, ETHREAD, EPROCESS,etc,etc,etc) and fail to dereference it, deleting it will have no effect until refcount gets down to zero. As a result, you get NP memory leak. Once kernel objects tend to have relatively long lifespan, and, hence, the resulting leak does not become noticeable on the spot…

The total number of handles isn’t trending upwards (sits at about 8500).

As long as total refcount is nonzero closing all handles to an object does not lead to its destruction, does it . Therefore, if you reference an object by, say, ObReferenceObject() and fail to dereference it this leak will not manifest itself as a handle one, although from the logical standpoint these two leak types are equivalent…

Anton Bassov