Problem with Special pool corruption

Hi All,
I currently facing problem in my volume filter driver. I get the SPECIAL
POOL CORRUPTION BSOD on one my test machines. The problem is that same
machine is giving me the BSOD after every two or three days. The machine
works fine meanwhile with no BSODs even after several reboots. The machine
is a AMD 64 machine and have WinXP 32-bit installed on it. I have set the
verifier settings for the Special pool tracking. The problem is that the
BSOD is not appearing on any other machine and not even reandomly.
Looks like my driver is corrupitng the memory by accessing the memory
withing the special pool page wich is not within the allocated range. While
freein this memory the machine is giving me the BSOD. So set the
verify_start flag using gflags.exe so I get the BSOD at the time this
invalid access is done. Even this is not helping. I don’t get any BSOD after
that.
Is there any way I can track down the code where this type of invalid
access is going on? I used verify start but no luck. The !poolval is giving
me the following output

Thanks,
Giri.

!poolval 868e6f68
Pool page 868e6f68 region is Unknown

Validating Pool headers for pool page: 868e6f68

Pool page [868e6000] is __inVALID.

Analyzing linked list…
[868e6000]: invalid previous size [0x94] should be [0x0]
[868e6000 –> 868e6008 (size = 0x8 bytes)]: Corrupt region

Scanning for single bit errors…

None found

kd> !analyze -v
*******************************************************************************
*
*
* Bugcheck
Analysis *
*
*
*******************************************************************************

SPECIAL_POOL_DETECTED_MEMORY_CORRUPTION (c1)
Special pool has detected memory corruption. Typically the current thread’s
stack backtrace will reveal the guilty party.
Arguments:
Arg1: 868e6f68, address trying to free
Arg2: 868e6131, address where bits are corrupted
Arg3: 00e10094, (reserved)
Arg4: 00000023, caller is freeing an address where nearby bytes within the
same page have been corrupted

Debugging Details:

The call to LoadLibrary(kext) failed, Win32 error 193
“%1 is not a valid Win32 application.”
Please check your debugger configuration and/or network access.
The call to LoadLibrary(kext) failed, Win32 error 193
“%1 is not a valid Win32 application.”
Please check your debugger configuration and/or network access.

BUGCHECK_STR: 0xC1_23

SPECIAL_POOL_CORRUPTION_TYPE: 23

CUSTOMER_CRASH_COUNT: 2

DEFAULT_BUCKET_ID: DRIVER_FAULT

LAST_CONTROL_TRANSFER: from 80660325 to 804f8925

STACK_TEXT:
ba745c04 80660325 000000c1 868e6f68 868e6131 nt!KeBugCheckEx+0x1b
ba745c50 80543a30 868e6f68 ba745cd3 85ce78e0 nt!MmFreeSpecialPool+0x2e3
ba745c90 80652690 868e6f68 00000000 8065282b nt!ExFreePoolWithTag+0x4a
ba745c9c 8065282b ffff0000 868e6f68 00000000 nt!VfIrpFree+0xc
ba745cb8 8064b4f1 00e49984 00000000 ba745d48 nt!VerifierIoFreeIrp+0x129
ba745cc8 8056bf6a 868e6f68 ba745d64 0110fec8 nt!IovFreeIrpPrivate+0x41
ba745d48 8053c808 00000190 0110ff00 0110fef0 nt!NtRemoveIoCompletion+0x12a
ba745d48 7c90eb94 00000190 0110ff00 0110fef0 nt!KiFastCallEntry+0xf8
WARNING: Frame IP not in any known module. Following frames may be wrong.
0110fed8 00000000 00000000 00000000 00000000 0x7c90eb94

STACK_COMMAND: kb

FOLLOWUP_IP:
nt!MmFreeSpecialPool+2e3
80660325 8b4708 mov eax,[edi+0x8]

FAULTING_SOURCE_CODE:

SYMBOL_STACK_INDEX: 1

FOLLOWUP_NAME: MachineOwner

SYMBOL_NAME: nt!MmFreeSpecialPool+2e3

MODULE_NAME: nt

DEBUG_FLR_IMAGE_TIMESTAMP: 42250a1d

IMAGE_NAME: memory_corruption

FAILURE_BUCKET_ID: 0xC1_23_nt!MmFreeSpecialPool+2e3

BUCKET_ID: 0xC1_23_nt!MmFreeSpecialPool+2e3

Followup: MachineOwner

kd> kb
ChildEBP RetAddr Args to Child
ba745c04 80660325 000000c1 868e6f68 868e6131 nt!KeBugCheckEx+0x1b
ba745c50 80543a30 868e6f68 ba745cd3 85ce78e0 nt!MmFreeSpecialPool+0x2e3
ba745c90 80652690 868e6f68 00000000 8065282b nt!ExFreePoolWithTag+0x4a
ba745c9c 8065282b ffff0000 868e6f68 00000000 nt!VfIrpFree+0xc
ba745cb8 8064b4f1 00e49984 00000000 ba745d48 nt!VerifierIoFreeIrp+0x129
ba745cc8 8056bf6a 868e6f68 ba745d64 0110fec8 nt!IovFreeIrpPrivate+0x41
ba745d48 8053c808 00000190 0110ff00 0110fef0 nt!NtRemoveIoCompletion+0x12a
ba745d48 7c90eb94 00000190 0110ff00 0110fef0 nt!KiFastCallEntry+0xf8
WARNING: Frame IP not in any known module. Following frames may be wrong.
0110fed8 00000000 00000000 00000000 00000000 0x7c90eb94
kd> !process
The call to LoadLibrary(kext) failed, Win32 error 193
“%1 is not a valid Win32 application.”
Please check your debugger configuration and/or network access.
GetPointerFromAddress: unable to read from 80557bb4
PROCESS 855976e8 SessionId: none Cid: 0398 Peb: 7ffde000 ParentCid:
02e4
DirBase: 17720060 ObjectTable: e1743710 HandleCount: Accessible>
Image: winlogon.exe
VadRoot 85e9d328 Vads 200 Clone 0 Private 1320. Modified 1860. Locked 0.
DeviceMap e1005440
Token e1968030
ReadMemory error: Cannot get nt!KeMaximumIncrement value.
ffdf0000: Unable to get shared data
ElapsedTime 00:00:00.000
UserTime 00:00:00.000
KernelTime 00:00:00.000
QuotaPoolUsage[PagedPool] 107108
QuotaPoolUsage[NonPagedPool] 518168
Working Set Sizes (now,min,max) (2067, 50, 345) (8268KB, 200KB, 1380KB)
PeakWorkingSetSize 3541
VirtualSize 61 Mb
PeakVirtualSize 66 Mb
PageFaultCount 8523
MemoryPriority BACKGROUND
BasePriority 13
CommitCharge 2039

Error in in reading nt!_ETHREAD @ 855836f0

kd> !process 0 0
The call to LoadLibrary(kext) failed, Win32 error 193
“%1 is not a valid Win32 application.”
Please check your debugger configuration and/or network access.
* NT ACTIVE PROCESS DUMP *
GetPointerFromAddress: unable to read from 80557bb4
Error in reading nt!_EPROCESS at 00000000
kd> !process
The call to LoadLibrary(kext) failed, Win32 error 193
“%1 is not a valid Win32 application.”
Please check your debugger configuration and/or network access.
GetPointerFromAddress: unable to read from 80557bb4
PROCESS 855976e8 SessionId: none Cid: 0398 Peb: 7ffde000 ParentCid:
02e4
DirBase: 17720060 ObjectTable: e1743710 HandleCount: Accessible>
Image: winlogon.exe
VadRoot 85e9d328 Vads 200 Clone 0 Private 1320. Modified 1860. Locked 0.
DeviceMap e1005440
Token e1968030
ReadMemory error: Cannot get nt!KeMaximumIncrement value.
ffdf0000: Unable to get shared data
ElapsedTime 00:00:00.000
UserTime 00:00:00.000
KernelTime 00:00:00.000
QuotaPoolUsage[PagedPool] 107108
QuotaPoolUsage[NonPagedPool] 518168
Working Set Sizes (now,min,max) (2067, 50, 345) (8268KB, 200KB, 1380KB)
PeakWorkingSetSize 3541
VirtualSize 61 Mb
PeakVirtualSize 66 Mb
PageFaultCount 8523
MemoryPriority BACKGROUND
BasePriority 13
CommitCharge 2039

Error in in reading nt!_ETHREAD @ 855836f0

Special pool as you are using it here has a single allocation (an IRP) allocated at the very end of a memory page (868e6f68). All of the rest of the memory on that page is filled with a specific fill pattern (looks like 0x94, based upon the !poolval output). For extra safety, a non-present page follows, so you will bugcheck if you go off the end [you didn’t, in this case].

When the memory was freed, the unused part of the page was checked to see if it was untouched. This scan shows that the fill pattern was overwritten starting at 868e6131. So someone wrote to this address between the time the IRP was allocated and the time it was freed.

These can be incredibly hard to track down. Some suggestions:

(1) First of all, look at the corrupted section- does it resemble any of your structures? Does it look like pointer addresses there? If so, are they valid [sometimes the pool tags for the pointers can tell you who the corruptor is].

(2) Use !vtop if you can to get the physical address for this page. Is this address present in one of your device’s DMA registers? If so, your device may have executed a transaction that should no have happened.

Given some of what I’m reading- the second trail is probably your best bet. If you’ve got a DMA Device writing to physical addresses it shouldn’t, there’s not much verifier can do to bugcheck at the time you do it- the processor is out of the loop when the corruption occurs.

Sometimes if I think I’m at fault, I add a table to my driver showing the last N physical buffers I used, so I can compare it to the place the corruption was detected when the bugcheck finally occurred.

Also, (IMO) it’s not too likely that anything done with the IRP is the cause- the corruption is pretty far from the IRP itself- almost at the start of the page.

The verifier start option you mention (IIRC) reverses the way the page is laid out- your object is at the start of the page, the remainder of the page is filled with a pattern, and the preceding page is not present.

These options will catch you in the act of overruning or underrunning a buffer, but there are other corruptors they can’t catch at the time they occur [even corruption via virtual address- stale list pointers, for instance].

FWIW, you may have some other corruption problems given some of the other messages [or this may not be a full dump].

Hi Robert,
I was thinking for verify_start the previous page is also marked as
inaccessible. Is that write?

This is what my driver is doing :
I am capturing the Write request IRP at the volume filter level.
From the captured IRP in IRP_MJ_WRITE, I get the CurrentIoStack location
[IoGetCurrentIrpStackLocation(pIrp);] & retieves the WrtieOffset and
WriteLength. Once this information is retrived I simply forward the IRP to
the lower attached driver. Before that I call
IoSkipCurrentIrpStackLocation(pIrp); for the IRP.

I have rechecked all my code I don’t see any code for writing to memory
allocated for IRP. Even from memory only the one byte is changed from the
fill. So thats of no use.
I also tried !vtop but the debugger gives following error :

kd> !vtop 868e6f68
The call to LoadLibrary(kext) failed, Win32 error 193
“%1 is not a valid Win32 application.”
Please check your debugger configuration and/or network access.
Pdi 0 Pti 0
TargetInfo::ReadPhysical is not available in the current debug session
physical read at 868e6000 failed

Thanks,
Girish.

On 9/21/06, xxxxx@microsoft.com wrote:
>
> Special pool as you are using it here has a single allocation (an IRP)
> allocated at the very end of a memory page (868e6f68). All of the rest of
> the memory on that page is filled with a specific fill pattern (looks like
> 0x94, based upon the !poolval output). For extra safety, a non-present page
> follows, so you will bugcheck if you go off the end [you didn’t, in this
> case].
>
> When the memory was freed, the unused part of the page was checked to see
> if it was untouched. This scan shows that the fill pattern was overwritten
> starting at 868e6131. So someone wrote to this address between the time the
> IRP was allocated and the time it was freed.
>
> These can be incredibly hard to track down. Some suggestions:
>
> (1) First of all, look at the corrupted section- does it resemble any of
> your structures? Does it look like pointer addresses there? If so, are
> they valid [sometimes the pool tags for the pointers can tell you who the
> corruptor is].
>
> (2) Use !vtop if you can to get the physical address for this page. Is
> this address present in one of your device’s DMA registers? If so, your
> device may have executed a transaction that should no have happened.
>
> Given some of what I’m reading- the second trail is probably your best
> bet. If you’ve got a DMA Device writing to physical addresses it shouldn’t,
> there’s not much verifier can do to bugcheck at the time you do it- the
> processor is out of the loop when the corruption occurs.
>
> Sometimes if I think I’m at fault, I add a table to my driver showing the
> last N physical buffers I used, so I can compare it to the place the
> corruption was detected when the bugcheck finally occurred.
>
> Also, (IMO) it’s not too likely that anything done with the IRP is the
> cause- the corruption is pretty far from the IRP itself- almost at the start
> of the page.
>
> The verifier start option you mention (IIRC) reverses the way the page is
> laid out- your object is at the start of the page, the remainder of the page
> is filled with a pattern, and the preceding page is not present.
>
> These options will catch you in the act of overruning or underrunning a
> buffer, but there are other corruptors they can’t catch at the time they
> occur [even corruption via virtual address- stale list pointers, for
> instance].
>
> FWIW, you may have some other corruption problems given some of the other
> messages [or this may not be a full dump].
>
> —
> Questions? First check the Kernel Driver FAQ at
> http://www.osronline.com/article.cfm?id=256
>
> To unsubscribe, visit the List Server section of OSR Online at
> http://www.osronline.com/page.cfm?name=ListServer
>

>>Hi Robert,
I was thinking for verify_start the previous page is also marked as inaccessible. Is that write?<<
Yes- the allocation is at the beginning of a page, the previous page is not present, and the rest of the allocation page is filled with a pattern. By default, all of that is reversed.

>This is what my driver is doing :
I am capturing the Write request IRP at the volume filter level.
From the captured IRP in IRP_MJ_WRITE, I get the CurrentIoStack location [IoGetCurrentIrpStackLocation(pIrp);] & retieves the WrtieOffset and WriteLength. Once this information is retrived I simply forward the IRP to the lower attached driver. Before that I call IoSkipCurrentIrpStackLocation(pIrp); for the IRP. <<

Sorry, I missed that you had a filter here. It’s very unlikely you corrupted physical memory in that case, so much of what I then babbled on about doesn’t apply to you.

I have rechecked all my code I don’t see any code for writing to memory allocated for IRP. Even from memory only the one byte is changed from the fill. So thats of no use. <

But it is the only clue available at the moment. I doubt the corrupting code actually has anything to do with the IRP. For instance, if you use an uninitialized pointer anywhere in your code, you can write to addresses formed from random stack data. So that one byte *may* give some hints, e.g.:

* Is the one-byte change something you might have done by ANDing or ORing (or some other logical /arithmetic operation) flag values well-known to you? In this case, it’s the difference that matters, not the actual value. If you only do this sort of operation in a few places, then that narrows the scope of a “search by code review”.

If nothing seems likely there, or that’s just not an idea you want to pursue- another thing I would recommend [if you haven’t already done so] is using Prefast or Static Driver Verifier (both in the RC1 WDK). Both can point out places where you may have potential problems of this sort. Prefast is usually the easier place to start, while SDV can get really in-depth. I know PreFast has saved me from having to debug this sort of thing in my own code more than once. Enough so that I try harder to annotate things properly to make it all work better for me.

Others on this list have highly recommended PC-LINT and I know there are other code analysis tools besides these out there. Even a cheap step like building with /W4 /WX and really looking at the errors may solve this one for you.

>I also tried !vtop but the debugger gives following error :

kd> !vtop 868e6f68
The call to LoadLibrary(kext) failed, Win32 error 193
“%1 is not a valid Win32 application.”
Please check your debugger configuration and/or network access.
Pdi 0 Pti 0
TargetInfo::ReadPhysical is not available in the current debug session
physical read at 868e6000 failed<<

Ypu need to give the page directory base for the current process as one of the parameters [Identified as DirBase in the !process output you had in the OP]. IIRC, that value has to be shifted into a PFN (drop bottom 10 bits). The !vtop documentation should discuss this.

Also some of the error messages showing up in your output lead me to suspect that you may have some other debugger-related problems [either side effects of corruption, bad symbols, partial dump- hard to tell from where I sit which, if any, it might be].

More to the point here, since you’re a filter driver, you’re not programming any physical addresses, so you don’t really need to follow up on this aspect.

As I said, these are really hard to track down. Good luck. There’s always the possibility someone else here has better advice, too!