!verifier "PoolAddress" -> !poolval <address>?

Mark_McDougall · May 11, 2007, 1:33am

Is the “PoolAddress” shown in the “!verifier 3” output then value I should
pass to the "!poolval! command?

ie. Is “pooladdress” the address of the pool headers or the pool memory page?

Regards,

–
Mark McDougall, Engineer
Virtual Logic Pty Ltd, http:
21-25 King St, Rockdale, 2216
Ph: +612-9599-3255 Fax: +612-9599-3266</http:>

Mark_McDougall · May 11, 2007, 3:26am

Mark McDougall wrote:

Is the “PoolAddress” shown in the “!verifier 3” output then value I
should pass to the "!poolval! command?

ie. Is “pooladdress” the address of the pool headers or the pool memory
page?

I have a breakpoint in my code immediately after a page is allocated from
the nonpaged pool. It breaks the first time it is run after a reboot.

Below is the output… how can it possibly be corrupt at this point?
I must be doing something wrong with the Pool Address???

Regards,

—8<------8<------8<------8<------8<------8<------8<------8<—

kd> !verifier 3

Verify Level b … enabled options are:
Special pool
Special irql
All pool allocations checked on unload

Summary of All Verifier Statistics

RaiseIrqls 0x3
AcquireSpinLocks 0x0
Synch Executions 0x0
Trims 0x3

Pool Allocations Attempted 0x1
Pool Allocations Succeeded 0x1
Pool Allocations Succeeded SpecialPool 0x1
Pool Allocations With NO TAG 0x0
Pool Allocations Failed 0x0
Resource Allocations Failed Deliberately 0x0

Current paged pool allocations 0x0 for 00000000 bytes
Peak paged pool allocations 0x0 for 00000000 bytes
Current nonpaged pool allocations 0x1 for 00002000 bytes
Peak nonpaged pool allocations 0x1 for 00002000 bytes

Driver Verification List

Entry State NonPagedPool PagedPool Module

82aedf08 Loaded 00002000 00000000 mk7iser.sys

Current Pool Allocations 00000000 00000001
Current Pool Bytes 00000000 00002000
Peak Pool Allocations 00000000 00000001
Peak Pool Bytes 00000000 00002000

PoolAddress SizeInBytes Tag CallersAddress
8268c000 0x00002000 COMX f6acc293

82aede88 Loaded 00000000 00000000 mk7ibus.sys

Current Pool Allocations 00000000 00000000
Current Pool Bytes 00000000 00000000
Peak Pool Allocations 00000000 00000000
Peak Pool Bytes 00000000 00000000

kd> !poolval 8268c000
Pool page 8268c000 region is Nonpaged pool

Validating Pool headers for pool page: 8268c000

Pool page [8268c000] is __inVALID.

Analyzing linked list…

Scanning for single bit errors…

None found

–
Mark McDougall, Engineer
Virtual Logic Pty Ltd, http:
21-25 King St, Rockdale, 2216
Ph: +612-9599-3255 Fax: +612-9599-3266</http:>

OSR_Community_User · May 11, 2007, 8:44am

!poolval is assuming you are looking at a “small pool” page- one in which a page of memory is suballocated with the associated pool headers, tags, etc.

The pool address you are looking at in this case is a “large pool” allocation- you allocated more in a single chunk than a small pool allocation can hold (in this case the !verifier extension says you allocated 8K). These are allocated in multi-page chunks, and their address is always page-aligned (small pool allocations are never page-aligned, because the page begins with a header).

So the extension says it is “invalid” because it is not a small pool allocation.

!verifier 3 gives the starting addresses of the allocated memory, not the addresses of any pool header.

What is it you are trying to do?

Mark_McDougall · May 13, 2007, 8:23pm

Bob Kjelgaard wrote:

So the extension says it is “invalid” because it is not a small pool
allocation.

Ah, OK, thanks, that would explain it!

!verifier 3 gives the starting addresses of the allocated memory, not
the addresses of any pool header.

OK, thanks again.

What is it you are trying to do?

Merely trying to understand (and catch) the source of what appears to be a
pool corruption by my driver. It very difficult because verifier hasn’t
caught it, even with special pool enabled.

All I know is that sometime after my driver has been running, another
random driver in the system will crash with some type of pool corruption
problem (a couple if different ones).

Interestingly, if I prevent my driver from freeing allocated pool pages, I
am yet to see a crash. That suggests to me that it is indeed somehow
corrupting the pages it is using… or freeing pages it isn’t using???

Regards,

–
Mark McDougall, Engineer
Virtual Logic Pty Ltd, http:
21-25 King St, Rockdale, 2216
Ph: +612-9599-3255 Fax: +612-9599-3266</http:>

OSR_Community_User · May 14, 2007, 1:16pm

>>

What is it you are trying to do?

Merely trying to understand (and catch) the source of what appears to be a
pool corruption by my driver. It very difficult because verifier hasn’t
caught it, even with special pool enabled.

All I know is that sometime after my driver has been running, another
random driver in the system will crash with some type of pool corruption
problem (a couple if different ones).

Interestingly, if I prevent my driver from freeing allocated pool pages, I
am yet to see a crash. That suggests to me that it is indeed somehow
corrupting the pages it is using… or freeing pages it isn’t using???

<<

Thanks, Mark (I’m thinking you actually said most of that to begin with, so it was probably not a very good question on my part).

Two other tools that *might* help are PreFast For Drivers, and Static Driver Verifier in the Vista WDK. They’ll work for WDM drivers (SDV requires “C” only). Both work directly on your build machine.

The comment about it not occurring if you don’t free anything did remind me of something, though:

Do you walk any list structures (e.g. using LIST_ENTRY) in some of these pool allocations? Is there any chance you might have cases where an entry points into a freed allocation? If the allocations aren’t freed, then you never corrupt anything, because nobody else ever gets those addresses and puts something other than your list entries in them.

In the bug I remember, the code was something like

LIST_ENTRY MyEntry = MyStruct->ListEntry;

AcquireLockOnTheList();

Remove(MyEntry);

I suspect the original author meant to use the address of the entry, but what they did was snap the current entry to the stack, then acquired the lock. Well, by the time you acquire it, someone else may have changed the entries before or after yours in the list, and that snapshot is worthless- the driver started “adjusting” pointers in memory now owned by someone else that often weren’t even pointers- add further problems with the resulting broken links, and you get pool corruption, stack corruption, etc…

You could generalize that back to “do you have pointers from one item to another that might not get updated properly before you free whatever they’re pointing at?”.

Sometimes one can figure out cheap tracking code for problems like that (signature checking to validate pointers, for instance)…

Mark_Roddy · May 14, 2007, 1:46pm

“signature checking to validate pointers” works great if used consistently.
Have a valid and invalid signature and check every (for some value every)
access to allocated regions for invalid or corrupt signatures. The truly
paranoid frame the data with signatures on either end. A single ULONG value
(like a pooltag) is sufficient and keeps the overhead down.

-----Original Message-----
From: xxxxx@lists.osr.com [mailto:bounce-286698-
xxxxx@lists.osr.com] On Behalf Of Bob Kjelgaard
Sent: Monday, May 14, 2007 1:16 PM
To: Windows System Software Devs Interest List
Subject: RE: [ntdev] !verifier “PoolAddress” -> !poolval

?
>
> >>
> > What is it you are trying to do?
>
> Merely trying to understand (and catch) the source of what appears to
> be a
> pool corruption by my driver. It very difficult because verifier hasn't
> caught it, even with special pool enabled.
>
> All I know is that sometime after my driver has been running, another
> random driver in the system will crash with some type of pool
> corruption
> problem (a couple if different ones).
>
> Interestingly, if I prevent my driver from freeing allocated pool
> pages, I
> am yet to see a crash. That suggests to me that it _is_ indeed somehow
> corrupting the pages it is using... or freeing pages it isn't using???
>
> <<
>
> Thanks, Mark (I'm thinking you actually said most of that to begin
> with, so it was probably not a very good question on my part).
>
> Two other tools that *might* help are PreFast For Drivers, and Static
> Driver Verifier in the Vista WDK. They'll work for WDM drivers (SDV
> requires "C" only). Both work directly on your build machine.
>
> The comment about it not occurring if you don't free anything did
> remind me of something, though:
>
> Do you walk any list structures (e.g. using LIST_ENTRY) in some of
> these pool allocations? Is there any chance you might have cases where
> an entry points into a freed allocation? If the allocations aren't
> freed, then you never corrupt anything, because nobody else ever gets
> those addresses and puts something other than your list entries in
> them.
>
> In the bug I remember, the code was something like
>
> LIST_ENTRY MyEntry = MyStruct->ListEntry;
>
> AcquireLockOnTheList();
>
> Remove(MyEntry);
>
> I suspect the original author meant to use the address of the entry,
> but what they did was snap the current entry to the stack, then
> acquired the lock. Well, by the time you acquire it, someone else may
> have changed the entries before or after yours in the list, and that
> snapshot is worthless- the driver started "adjusting" pointers in
> memory now owned by someone else that often weren't even pointers- add
> further problems with the resulting broken links, and you get pool
> corruption, stack corruption, etc...
>
> You could generalize that back to "do you have pointers from one item
> to another that might not get updated properly before you free whatever
> they're pointing at?".
>
> Sometimes one can figure out cheap tracking code for problems like that
> (signature checking to validate pointers, for instance)...
>
>
> ---
> Questions? First check the Kernel Driver FAQ at
> http://www.osronline.com/article.cfm?id=256
>
> To unsubscribe, visit the List Server section of OSR Online at
> http://www.osronline.com/page.cfm?name=ListServer

OSR_Community_User · May 14, 2007, 3:29pm

I had a similar reaction to Bob when I read the part about not seeing
the problem until freeing memory. I haven’t really followed this thread
much, so this may not be all that applicable, but I thought I would
throw it in for good measure, as it is more or less a variation on the
one he has already mentioned, involving levels of indirection -
basically, one declares what one thinks is a pointer to an object, but
is actually a pointer to a pointer an object.

Here’s an example that was posted in the past couple of weeks on this
list:

typedef struct
{
PSP_DEVICE_INTERFACE_DETAIL_DATA deviceDetailData;
HANDLE deviceInterfaceHandle;
}*DeviceData;

deviceList = (DeviceData *) malloc(index * sizeof(DeviceData));

In this case, the author had copied the code from a WinUSB sample, and
messed up the struct: DeviceData was supposed to be the tag, and the
typedef was supposed to be * DeviceList. The malloc() is actually
correct as written, but you see where this is going. While this is a
less than glamourous error, the magic of casting makes it quite possible
to do this inadvertently, particularly with unfamiliar code, and the
whole thing will compile, link, load and possibly run without error
until you free memory, because somehow enough memory gets allocated
overall, even though the details of the allocations are incorrect.

Best of luck,

mm

The scenario I have in mind involves

>> xxxxx@microsoft.com 2007-05-14 13:15 >>>
>
What is it you are trying to do?

Merely trying to understand (and catch) the source of what appears to
be a
pool corruption by my driver. It very difficult because verifier
hasn’t
caught it, even with special pool enabled.

All I know is that sometime after my driver has been running, another
random driver in the system will crash with some type of pool
corruption
problem (a couple if different ones).

Interestingly, if I prevent my driver from freeing allocated pool
pages, I
am yet to see a crash. That suggests to me that it is indeed somehow
corrupting the pages it is using… or freeing pages it isn’t using???

<<

Thanks, Mark (I’m thinking you actually said most of that to begin
with, so it was probably not a very good question on my part).

Two other tools that *might* help are PreFast For Drivers, and Static
Driver Verifier in the Vista WDK. They’ll work for WDM drivers (SDV
requires “C” only). Both work directly on your build machine.

The comment about it not occurring if you don’t free anything did
remind me of something, though:

Do you walk any list structures (e.g. using LIST_ENTRY) in some of
these pool allocations? Is there any chance you might have cases where
an entry points into a freed allocation? If the allocations aren’t
freed, then you never corrupt anything, because nobody else ever gets
those addresses and puts something other than your list entries in
them.

In the bug I remember, the code was something like

LIST_ENTRY MyEntry = MyStruct->ListEntry;

AcquireLockOnTheList();

Remove(MyEntry);

I suspect the original author meant to use the address of the entry,
but what they did was snap the current entry to the stack, then acquired
the lock. Well, by the time you acquire it, someone else may have
changed the entries before or after yours in the list, and that snapshot
is worthless- the driver started “adjusting” pointers in memory now
owned by someone else that often weren’t even pointers- add further
problems with the resulting broken links, and you get pool corruption,
stack corruption, etc…

You could generalize that back to “do you have pointers from one item
to another that might not get updated properly before you free whatever
they’re pointing at?”.

Sometimes one can figure out cheap tracking code for problems like that
(signature checking to validate pointers, for instance)…

Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

Mark_McDougall · May 15, 2007, 12:29am

Bob Kjelgaard wrote:

Two other tools that *might* help are PreFast For Drivers, and Static
Driver Verifier in the Vista WDK. They’ll work for WDM drivers (SDV
requires “C” only). Both work directly on your build machine.

Thanks Bob, I’ll look into them…

You could generalize that back to “do you have pointers from one item
to another that might not get updated properly before you free whatever
they’re pointing at?”.

“My” driver is a slightly modified version of the WDM Serial.sys. It
controls 16550-like UARTS on the PCI bus rather than legacy devices. Hence
the hardware resources are allocated differently and the driver acquires
the interrupt lock of the parent bus driver rather than calling
WdfInterruptSynchronize throughout the code.

The pool entries in question are the InterruptReadBuffer allocations made
whenever the COM port is opened or closed. This buffer is strictly a
receive buffer for UART data, and although the code does maintain pointers
to various offsets within the buffer, it generally isn’t re-allocated at
any point. I’ve been through the code that manipulates this buffer and
associated pointers - keeping things like pre-emption and race conditions
in mind - but can’t see anything.

What stumps me is the fact that, IIUC the large pool allocations don’t
store pool headers on the same page as the buffer memory, so a buffer
over/under-run shouldn’t generally corrupt those headers. In any case,
with special pool pages on it should bug-check under these conditions.

So I’m at a loss to explain how/where the pool structures are being
corrupted in this way. I’m wondering if there’s a mechanism that allows me
to walk the pool header lists for the whole system at any point in my code
to check for corruption there???

Regards,

–
Mark McDougall, Engineer
Virtual Logic Pty Ltd, http:
21-25 King St, Rockdale, 2216
Ph: +612-9599-3255 Fax: +612-9599-3266</http:>

OSR_Community_User · May 15, 2007, 9:11am

Mark-

Wrt forcing header checks, it isn’t directly accessible- you can try to force one by allocating and freeing several allocations at the point you want it done, but the walking is deferred for performance reasons, so it’s hard to be sure when it happens. There is no routine you can call directly.

As far as the rest goes-

Do you look for or see cases where an interrupt is serviced AFTER the port is closed? Would such a case cause an access to freed memory? Are there any race conditions around knowing whether the buffer exists or not? (You may have already thought about those, but I thought I’d ask just in case).

One other thing I’d try, especially since this seems to be narrowed down to this buffer already:
(1) “free” the buffer by filling it with a recognizable pattern (like the ever popular “0badf00d” or “deadbeef”), but don’t actually free it to the memory manager.
(2) periodically check your freed buffers for disturbance of the pattern (also, since I assume you read the data in a Dpc, check for there being a pattern in the Dpc itself before you even touch the buffer- if it’s there, you’ve found your bug).

The signature approach mentioned earlier is faster- put a signature in the buffer itself (or in the control structure [probably device context?], or even both) that denotes “allocated” or “freed” state- say in the first byte (losing one byte out of 8K isn’t much of a price). Just flip the state to “free” the buffer. If you’re using a signature in each, check for mismatches before proceeding, etc.

But that is assuming the cause is touching the freed buffer directly, most likely in the Dpc that receives data- the first approach will also catch indirect accesses if they are happening, so you can try to zero in on them.

If doing that doesn’t catch it, then I think it’s worth taking another look at the original bugcheck again.