Heap Corruption

All,

this is possibly off topic for this forum, but honestly, I couldn’t
find a better forum than ntdev for technical questions!

Platform:

win2k3 and winxp 32 bit

Scenario:

We have a heap corruption caused by a service of ours. This service
collects logs from our driver. The setup is very similar to filespy,
and all file ops are logged by the service.

At times when the service gets logs from the driver, it crashes.

Root cause:

We have root caused the issue already to be that one of the objects
inside the service is getting destroyed, however, the thread which
polls for the log is still alive and tries accessing the object.

We debugged this by enabling pageheap on the service and it is simple
to reproduce. The trouble is, the exact asme code runs on win2k3 x64
and winxp x64 and all other 32 and 64 bit windows like win 7 and win
2k8 without a crash. No amount of debugging, and scenario reproduction
has made this crash happen ever (ever being a long time really).

Now, we need to find out why it wont crash for other OSs or bit depths
with pageheap or other wise, it is a simple enough crash. I read about
FTH in win 7 and above, but FTH will suppress the error and log an
event, so far the event logs have also come out to be clean. So no
suppression at all.

So what is so different in the memory manager of win2k3-x64 or otehr
future 32 and 64 bit OSs that we are unable to reproduce this? Can
some one please help me with a possible root cause here?

Thanks

Ami

You know you destroy an object that is still accessed by your log polling thread. This is *very* good starting point.

Why don’t you do a code review and deduce why and in what conditions the object could possibly be destroyed, and fix that conditions? Add debug output to the destructor. If the object is destroyed while the thread is alive, break into the debugger, etc.

Heap corruption in an app, particularly of the kind you describe,
generally has a deeper root cause than you have come up with. The real
root cause is erroneous design and/or coding; the fact that an object can
be deleted in one thread while being live in another thread indicates a
fundamental design/implementation error.

The only way to do this correctly is to have the notion of “ownership”
rigorously defined. Then it works like this: at any given point in time,
an object is “owned” by precisely one thread. A thread may transfer its
ownership by passing a pointer to the object to another thread, at which
point the previous owner forgets that the object ever existed. If it puts
the pointer into a queue, for example, then it must never, ever reference
that object again. Unless, of course, ownership is handed back to it, at
which point it sees, in effect, a new object, and has no history of ever
having had it before.

Problems like this are timing-sensitive and also sensitive to various
internal details of the allocator; the fact that it does not fail on other
platforms can be attributed to bad luck, in that the bug is lurking in
there and /could/ manifest at any time. The fact that it does not show up
does not prove it is not there.

One possible approach is to compile the 64-bit version using the debug
configuration. The debug allocator overwrites objects with a nonsense
pattern (in 32-bit systems it is 0xFEEEFEEE) which means that a “stale”
pointer is guaranteed to see garbage, but in a normal allocator, most or
all of the data remains intact soyou don’t notice the pointer is stale.

The real solution is to never write any form of code in which two threads
can think they own a single object. Note this is different from sharing
an object; for example, a queue is created and is shared between two
threads. Although each thread is free to access the queue, /neither/
thread “owns” it. Therefore, neither thread has the right to delete it.
Thing of a queue in the device extension. It is created by the AddDevice
thread, and deleted by the remove-device thread, and these are guaranteed
to not be concurrent. It is shared by the passive-level interface and the
DPC level (in simplified drivers) The IRPs, on the other hand, are handed
off from one thread by enqueue and ownership is obtained by a dequeue. In
principle, only one thread at a time owns the IRP. If you read the
archives of this NG, you will see lots of people having problems because
they violate the single-thread-ownership principle.

Of course, access to shared objects must involve some kind of mutual
exclusion mechanism, such as a spin lock (CRITICAL_SECTION in user space,
sort of the equivalent) or mutex (or FAST_MUTEX or ERESOURCE in the
kernel).

But from your description, my evaluation is that the design is probably
the /real/ root cause, and what you referred to as the “root cause” is
merely a manifestation of bad design. Alternatively, the design was sound
but the implementation did not follow the design, and consequently you are
left with the problem of multithread ownership. Note that an owned object
implies the owner has the right to delete the object with no concern for
consequences, and no non-owner thread is permitted to access the object
for any purpose whatsoever. This is what you really have to look for.

I’m sure someone will say “but you could use reference counts”. This will
work as long as everyone handles the reference counts correctly. This
means that not only can any thread “free” the object (reduce its reference
count by 1 and only release the storage back to the heap when the
reference count goes to 0), but every thread /must/ free the object when
it is finished with it. You may try this, but don’t be surprised at how
difficult it becomes to make sure the reference count is properly
maintained (and I don’t mean being sure you use InterlockedIncrement and
not ++). I have found in decades of multithreaded programming that the
single-owner model is tbe easiest to specify, to reason about, and get
right the first time.
joe

All,

this is possibly off topic for this forum, but honestly, I couldn’t
find a better forum than ntdev for technical questions!

Platform:

win2k3 and winxp 32 bit

Scenario:

We have a heap corruption caused by a service of ours. This service
collects logs from our driver. The setup is very similar to filespy,
and all file ops are logged by the service.

At times when the service gets logs from the driver, it crashes.

Root cause:

We have root caused the issue already to be that one of the objects
inside the service is getting destroyed, however, the thread which
polls for the log is still alive and tries accessing the object.

We debugged this by enabling pageheap on the service and it is simple
to reproduce. The trouble is, the exact asme code runs on win2k3 x64
and winxp x64 and all other 32 and 64 bit windows like win 7 and win
2k8 without a crash. No amount of debugging, and scenario reproduction
has made this crash happen ever (ever being a long time really).

Now, we need to find out why it wont crash for other OSs or bit depths
with pageheap or other wise, it is a simple enough crash. I read about
FTH in win 7 and above, but FTH will suppress the error and log an
event, so far the event logs have also come out to be clean. So no
suppression at all.

So what is so different in the memory manager of win2k3-x64 or otehr
future 32 and 64 bit OSs that we are unable to reproduce this? Can
some one please help me with a possible root cause here?

Thanks

Ami


NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

Also note that multiple threads can share read access to an object
simultaneously. This pattern uses reference counts and the thread that
happens to release the last reference is the one that destroys the object

wrote in message news:xxxxx@ntdev…

Heap corruption in an app, particularly of the kind you describe,
generally has a deeper root cause than you have come up with. The real
root cause is erroneous design and/or coding; the fact that an object can
be deleted in one thread while being live in another thread indicates a
fundamental design/implementation error.

The only way to do this correctly is to have the notion of “ownership”
rigorously defined. Then it works like this: at any given point in time,
an object is “owned” by precisely one thread. A thread may transfer its
ownership by passing a pointer to the object to another thread, at which
point the previous owner forgets that the object ever existed. If it puts
the pointer into a queue, for example, then it must never, ever reference
that object again. Unless, of course, ownership is handed back to it, at
which point it sees, in effect, a new object, and has no history of ever
having had it before.

Problems like this are timing-sensitive and also sensitive to various
internal details of the allocator; the fact that it does not fail on other
platforms can be attributed to bad luck, in that the bug is lurking in
there and /could/ manifest at any time. The fact that it does not show up
does not prove it is not there.

One possible approach is to compile the 64-bit version using the debug
configuration. The debug allocator overwrites objects with a nonsense
pattern (in 32-bit systems it is 0xFEEEFEEE) which means that a “stale”
pointer is guaranteed to see garbage, but in a normal allocator, most or
all of the data remains intact soyou don’t notice the pointer is stale.

The real solution is to never write any form of code in which two threads
can think they own a single object. Note this is different from sharing
an object; for example, a queue is created and is shared between two
threads. Although each thread is free to access the queue, /neither/
thread “owns” it. Therefore, neither thread has the right to delete it.
Thing of a queue in the device extension. It is created by the AddDevice
thread, and deleted by the remove-device thread, and these are guaranteed
to not be concurrent. It is shared by the passive-level interface and the
DPC level (in simplified drivers) The IRPs, on the other hand, are handed
off from one thread by enqueue and ownership is obtained by a dequeue. In
principle, only one thread at a time owns the IRP. If you read the
archives of this NG, you will see lots of people having problems because
they violate the single-thread-ownership principle.

Of course, access to shared objects must involve some kind of mutual
exclusion mechanism, such as a spin lock (CRITICAL_SECTION in user space,
sort of the equivalent) or mutex (or FAST_MUTEX or ERESOURCE in the
kernel).

But from your description, my evaluation is that the design is probably
the /real/ root cause, and what you referred to as the “root cause” is
merely a manifestation of bad design. Alternatively, the design was sound
but the implementation did not follow the design, and consequently you are
left with the problem of multithread ownership. Note that an owned object
implies the owner has the right to delete the object with no concern for
consequences, and no non-owner thread is permitted to access the object
for any purpose whatsoever. This is what you really have to look for.

I’m sure someone will say “but you could use reference counts”. This will
work as long as everyone handles the reference counts correctly. This
means that not only can any thread “free” the object (reduce its reference
count by 1 and only release the storage back to the heap when the
reference count goes to 0), but every thread /must/ free the object when
it is finished with it. You may try this, but don’t be surprised at how
difficult it becomes to make sure the reference count is properly
maintained (and I don’t mean being sure you use InterlockedIncrement and
not ++). I have found in decades of multithreaded programming that the
single-owner model is tbe easiest to specify, to reason about, and get
right the first time.
joe

All,

this is possibly off topic for this forum, but honestly, I couldn’t
find a better forum than ntdev for technical questions!

Platform:

win2k3 and winxp 32 bit

Scenario:

We have a heap corruption caused by a service of ours. This service
collects logs from our driver. The setup is very similar to filespy,
and all file ops are logged by the service.

At times when the service gets logs from the driver, it crashes.

Root cause:

We have root caused the issue already to be that one of the objects
inside the service is getting destroyed, however, the thread which
polls for the log is still alive and tries accessing the object.

We debugged this by enabling pageheap on the service and it is simple
to reproduce. The trouble is, the exact asme code runs on win2k3 x64
and winxp x64 and all other 32 and 64 bit windows like win 7 and win
2k8 without a crash. No amount of debugging, and scenario reproduction
has made this crash happen ever (ever being a long time really).

Now, we need to find out why it wont crash for other OSs or bit depths
with pageheap or other wise, it is a simple enough crash. I read about
FTH in win 7 and above, but FTH will suppress the error and log an
event, so far the event logs have also come out to be clean. So no
suppression at all.

So what is so different in the memory manager of win2k3-x64 or otehr
future 32 and 64 bit OSs that we are unable to reproduce this? Can
some one please help me with a possible root cause here?

Thanks

Ami


NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

Especially with this type of race bug there are very many variables that can
influence the way the bug manifests. I would first fix my logic problem and
then try to explain why it didn’t crash. If you know that your software
crashes you know for a fact that it is broken. While the software is still
running you know nothing, except that it can crash at any time in the
future.

//Daniel

“Ami Awbadhho” wrote in message news:xxxxx@ntdev…
Now, we need to find out why it wont crash for other OSs