Memory corruption + race condition bug hides when running Driver Verifier

Hello all,

I’ve got this stubborn bug that manifests itself with various different bugchecks: IRQL_NOT_LESS_OR_EQUAL, DRIVER_CORRUPTED_EXPOOL, BAD_POOL_HEADER, BAD_POOL_CALLER, et cetera. My first reaction was to run Driver Verifier with Special Pool and Pool Tracking, but I am not able to reproduce this bug at all with Driver Verifier running, regardless of the verification level, leaving me completely baffled on how to fix this. I am able to consistently replicate this when the verifier isn’t running on my driver.

Any ideas on how to tackle this?

  • dpj

I hesitate to suggest this in case it’s too obvious, but have you attempted to reproduce it while attached to Windbg, then run !analyze -v when the bug check happens?

From: xxxxx@lists.osr.com [mailto:xxxxx@lists.osr.com] On Behalf Of David Jackson
Sent: 05 October 2015 21:02
To: Kernel Debugging Interest List
Subject: [windbg] Memory corruption + race condition bug hides when running Driver Verifier

Hello all,

I’ve got this stubborn bug that manifests itself with various different bugchecks: IRQL_NOT_LESS_OR_EQUAL, DRIVER_CORRUPTED_EXPOOL, BAD_POOL_HEADER, BAD_POOL_CALLER, et cetera. My first reaction was to run Driver Verifier with Special Pool and Pool Tracking, but I am not able to reproduce this bug at all with Driver Verifier running, regardless of the verification level, leaving me completely baffled on how to fix this. I am able to consistently replicate this when the verifier isn’t running on my driver.

Any ideas on how to tackle this?

  • dpj

WINDBG is sponsored by OSR

OSR is hiring!! Info at http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer
This email message has been delivered safely and archived online by Mimecast.

For more information please visit http://www.mimecast.com

David Jackson wrote:

I’ve got this stubborn bug that manifests itself with various different
bugchecks: IRQL_NOT_LESS_OR_EQUAL, DRIVER_CORRUPTED_EXPOOL,
BAD_POOL_HEADER, BAD_POOL_CALLER, et cetera. My first reaction was to
run Driver Verifier with Special Pool and Pool Tracking, but I am not
able to reproduce this bug at all with Driver Verifier running,
regardless of the verification level, leaving me completely baffled on
how to fix this. I am able to consistently replicate this when the
verifier isn’t running on my driver.

Any ideas on how to tackle this?

Well, not exactly. Verifier *is* the right choice to find the root cause
of such pool problems, and if it hides the problem instead, you’re out
of luck.

All I can offer is anecdotes about the kind of errors that have given me
such bugcheck codes in the past. It usually had to do with freeing MDLs
that I didn’t own, with some variation in the exact circumstances:

  • freeing an MDL that I didn’t own at all (received from the next upper
    driver)
  • freeing an MDL too soon, while the next lower driver was still working
    with it
  • returning an IRP too soon, whose MDL I had passed on to the next lower
    driver.

> Any ideas on how to tackle this?

IIRC there is a gflags setting to make verifier look for buffer under-runs rather than over-runs (so the pool is allocated to start immediate ahead of a no-access page rather than to end immediately before one)
It might help

David Boyce wrote:

I hesitate to suggest this in case it’s too obvious, but have you
attempted to reproduce it while attached to Windbg, then run !analyze
-v when the bug check happens?

Of course.? The general theme I’m seeing is that there’s either a hang on or invalid memory access within a specific KeAcquireSpinLockRaiseToDpc call inside of a lower level driver.? I am somehow corrupting at least this particular KEVENT structure, but there’s no indication of when or how.? The contents it’s being overwritten with are rather nondescript, so it isn’t that easy to take a guess at where the problem may lie.

Wilhelm Noeker wrote:

Well, not exactly. Verifier *is* the right choice to find the root cause
of such pool problems, and if it hides the problem instead, you’re out
of luck.

All I can offer is anecdotes about the kind of errors that have given me
such bugcheck codes in the past. It usually had to do with freeing MDLs
that I didn’t own, with some variation in the exact circumstances:

  • freeing an MDL that I didn’t own at all (received from the next upper
    driver)
  • freeing an MDL too soon, while the next lower driver was still working
    with it
  • returning an IRP too soon, whose MDL I had passed on to the next lower
    driver.

I’m actually kind of skeptical of the simplistic explanation that Driver Verifier makes the system slightly slower and therefore makes the problem unreproducible.? I tried this with a partial checked build, which in theory should be much slower, but I still had a readily reproducible bluescreens as with free builds.

Do you think it would help if I changed the number of cores in my test VM?

I noticed on the MSDN page for Driver Verifier options it mentions “Kernel synchronization delay fuzzing” which sounds like it would be super useful for my needs.? How come I can’t see it?

Sorry about that last message, looks like vBulletin has problems with Unix line endings.
Here is my last message again, converted to crlf:

David Boyce wrote:

I hesitate to suggest this in case it’s too obvious, but have you
attempted to reproduce it while attached to Windbg, then run !analyze
-v when the bug check happens?

Of course.? The general theme I’m seeing is that there’s either a hang on or invalid memory access within a specific KeAcquireSpinLockRaiseToDpc call inside of a lower level driver.? I am somehow corrupting at least this particular KEVENT structure, but there’s no indication of when or how.? The contents it’s being overwritten with are rather nondescript, so it isn’t that easy to take a guess at where the problem may lie.

Wilhelm Noeker wrote:

Well, not exactly. Verifier *is* the right choice to find the root cause
of such pool problems, and if it hides the problem instead, you’re out
of luck.

All I can offer is anecdotes about the kind of errors that have given me
such bugcheck codes in the past. It usually had to do with freeing MDLs
that I didn’t own, with some variation in the exact circumstances:

  • freeing an MDL that I didn’t own at all (received from the next upper
    driver)
  • freeing an MDL too soon, while the next lower driver was still working
    with it
  • returning an IRP too soon, whose MDL I had passed on to the next lower
    driver.

I’m actually kind of skeptical of the simplistic explanation that Driver Verifier makes the system slightly slower and therefore makes the problem unreproducible.? I tried this with a partial checked build, which in theory should be much slower, but I still had a readily reproducible bluescreens as with free builds.

Do you think it would help if I changed the number of cores in my test VM?

I noticed on the MSDN page for Driver Verifier options it mentions “Kernel synchronization delay fuzzing” which sounds like it would be super useful for my needs.? How come I can’t see it?

First thing when you google it:

“Kernel synchronization delay fuzzing”
*Note* This option is available starting with Windows 8.1.

But this post has already several answers, I was personally expecting a
copy paste of one of the several crash outputs ( of analyze -v ) to figure
out better the problem.
You are giving too little and generic information to work with. It can be
anything.

Good luck.

On Wed, Oct 7, 2015 at 10:07 AM, David Jackson wrote:

> Sorry about that last message, looks like vBulletin has problems with Unix
> line endings.
> Here is my last message again, converted to crlf:
>
> David Boyce wrote:
>
> > I hesitate to suggest this in case it’s too obvious, but have you
> > attempted to reproduce it while attached to Windbg, then run !analyze
> > -v when the bug check happens?
>
> Of course. The general theme I’m seeing is that there’s either a hang on
> or invalid memory access within a specific KeAcquireSpinLockRaiseToDpc call
> inside of a lower level driver. I am somehow corrupting at least this
> particular KEVENT structure, but there’s no indication of when or how. The
> contents it’s being overwritten with are rather nondescript, so it isn’t
> that easy to take a guess at where the problem may lie.
>
>
> Wilhelm Noeker wrote:
>
> > Well, not exactly. Verifier is the right choice to find the root cause
> > of such pool problems, and if it hides the problem instead, you’re out
> > of luck.
> >
> > All I can offer is anecdotes about the kind of errors that have given me
> > such bugcheck codes in the past. It usually had to do with freeing MDLs
> > that I didn’t own, with some variation in the exact circumstances:
> > - freeing an MDL that I didn’t own at all (received from the next upper
> > driver)
> > - freeing an MDL too soon, while the next lower driver was still working
> > with it
> > - returning an IRP too soon, whose MDL I had passed on to the next lower
> > driver.
>
> I’m actually kind of skeptical of the simplistic explanation that Driver
> Verifier makes the system slightly slower and therefore makes the problem
> unreproducible. I tried this with a partial checked build, which in theory
> should be much slower, but I still had a readily reproducible bluescreens
> as with free builds.
>
> Do you think it would help if I changed the number of cores in my test VM?
>
> I noticed on the MSDN page for Driver Verifier options it mentions “Kernel
> synchronization delay fuzzing” which sounds like it would be super useful
> for my needs. How come I can’t see it?
>
>
> —
> WINDBG is sponsored by OSR
>
> OSR is hiring!! Info at http://www.osr.com/careers
>
> For our schedule of WDF, WDM, debugging and other seminars visit:
> http://www.osr.com/seminars
>
> To unsubscribe, visit the List Server section of OSR Online at
> http://www.osronline.com/page.cfm?name=ListServer
>


Bercea. G.

David Jackson wrote:

The general theme I’m seeing is that there’s either a hang on or invalid memory access within a specific KeAcquireSpinLockRaiseToDpc call inside of a lower level driver. I am somehow corrupting at least this particular KEVENT structure, but there’s no indication of when or how. The contents it’s being overwritten with are rather nondescript, so it isn’t that easy to take a guess at where the problem may lie.

How does this match with your original description of the problem, where
you only mentioned the bugcheck codes (IRQL_NOT_LESS_OR_EQUAL,
DRIVER_CORRUPTED_EXPOOL, BAD_POOL_HEADER, BAD_POOL_CALLER), three of
which are indeed specific to the memory manager? I don’t see how you can
get, for example, BAD_POOL_CALLER by calling KeAcquireSpinLockRaiseToDpc.

Anyway, if you are suspecting that it is a particular piece of memory
that gets corrupted (the KEVENT that you are mentioning now), that could
be tracked down using a data breakpoint (“ba” command in the debugger).