Problem with disappearing MDL

I’m trying to chase down an obscure bug in a hardware device driver.

Background: This is the only driver in the stack for the target
hardware. No IRPs are ever ‘sent down’. The driver is built from
Oney’s framework as published in the first edition of “Programming the
Microsoft Windows Driver Model”.

Scenario: Userland performs a query through IOCTL. This query is
placed in the write queue for the device, the IRP is locally cached,
marked pending and DispatchControl() returns an appropriate value. When
the periodic read service picks up the response to the query, the
initiating IRP is recovered from the internal cache, data is copied to
the userland buffer and the IRP is completed.

Symptom: On rare occasions, the MdlAddress pointer in the IRP
“disappears” (becomes reset to NULL) between the time DispatchControl is
called and the time the transaction handler attempts to place the IRP
into the internal cache. A userland pointer is unconditionally
recovered from the MdlAddress field as DispatchControl enters, using
MmGetSystemAddressForMdl. (I know it’s deprecated, but the driver has
to run on Win98 as well) DispatchControl calls a transaction handler
function to place the query in the write queue and cache the IRP for
later completion. The handler calls a cache function to perform the add
to the internal cache. It is at this point (adding the pending
transaction to the internal cache) that the MdlAddress pointer is tested.

Questions: What could cause the MdlAddress pointer to be NULLed between
the initial MmGetSystemAddressForMdl call and the attempt to add to the
internal cache? Would moving the driver to the second edition version
of Oney’s framework address this?

On a related note: The userland query function has a timeout value and
the buffer provided for retrieving the query response is an automatic
variable. (legacy library code… I’m not supposed to change it without
a lot of justification) If a query should take longer that
user_timeout to complete, the buffer represented by the MdlAddress field
of the cached IRP will no longer exist. KdPrint statements in the
OnCancel routine never show up in the debug console, so presumably the
IRP is not cancelled. What happens to MdlAddress? (I’m theorizing that
timeouts may be occurring under heavy device loads… the symptom only
seems to occur with one user’s userland program, which is not
instrumented very well for debugging)

Many thanks in advance for any clues you may have to offer.

Roy M. Silvernail - xxxxx@parker.com


“PLEASE NOTE: The preceding information may be confidential or privileged. It only should be used or disseminated for the purpose of conducting business with Parker. If you are not an intended recipient, please notify the sender by replying to this message and then delete the information from your system. Thank you for your cooperation.”

xxxxx@parker.com wrote:

Symptom: On rare occasions, the MdlAddress pointer in the IRP
“disappears” (becomes reset to NULL) between the time DispatchControl is
called and the time the transaction handler attempts to place the IRP
into the internal cache.

The only way I can imagine this happening is that the IRP is somehow
getting completed. When you said that you marked this IRP pending and
returned an “appropriate value”, the only “appropriate value” would be
STATUS_PENDING.

Are you using the IOCTL caching scheme from the book? If so, can you
post relevant extracts from your code to show where you’re caching and
uncaching the IRP?

Potentially, switching to the 2d edition version of the wizard and
GENERIC (including service packs) will improve things.


Walter Oney, Consulting and Training
Basic and Advanced Driver Programming Seminars
Check out our schedule at http://www.oneysoft.com

Walter Oney wrote:

xxxxx@parker.com wrote:

>Symptom: On rare occasions, the MdlAddress pointer in the IRP
>“disappears” (becomes reset to NULL) between the time DispatchControl is
>called and the time the transaction handler attempts to place the IRP
>into the internal cache.

The only way I can imagine this happening is that the IRP is somehow
getting completed. When you said that you marked this IRP pending and
returned an “appropriate value”, the only “appropriate value” would be
STATUS_PENDING.

OK, I was sounding a bit too formal. Yes, I’m returning STATUS_PENDING.
Also, I’m not pending the IRP until after it’s been successfully cached,
and only placing the transaction in the write queue after pending the
IRP. That shouldn’t provide any opportunity for early completion.

An interesting datapoint from last night’s test run. After trapping 2
missing MDLs, the driver still crashed with an access violation when
trying to mark an IRP pending. That means the IRP somehow lost its
stack location while being processed. Once again, no indication of a
cancellation.

Are you using the IOCTL caching scheme from the book? If so, can you
post relevant extracts from your code to show where you’re caching and
uncaching the IRP?

Unfortunately, I’m not. The device has the potential to present replies
out of order, and some transactions are initiated by the driver itself,
so I need to identify the returning transactions and determine whether a
pending IRP is waiting for this reply. A FIFO cache won’t work, and I
need to also store the transaction signature for later lookup.

Potentially, switching to the 2d edition version of the wizard and
GENERIC (including service packs) will improve things.

I think I’ll push for that. And I’ll extract some code and post it.

Thanks very much for replying.

Thus far, everything you have described points to unexpected completion.
MDL’s go away. Stack locations disappear. If you look at the first 4 bytes
of the IRP, I’ll bet you will find they look like a pointer and not the
normal sanity bytes of Type/Size.


Gary G. Little
Seagate Technologies, LLC

“Roy M. Silvernail” wrote in message
news:xxxxx@ntdev…
>
> Walter Oney wrote:
> > xxxxx@parker.com wrote:
> >
> >>Symptom: On rare occasions, the MdlAddress pointer in the IRP
> >>“disappears” (becomes reset to NULL) between the time DispatchControl is
> >>called and the time the transaction handler attempts to place the IRP
> >>into the internal cache.
> >
> >
> > The only way I can imagine this happening is that the IRP is somehow
> > getting completed. When you said that you marked this IRP pending and
> > returned an “appropriate value”, the only “appropriate value” would be
> > STATUS_PENDING.
>
> OK, I was sounding a bit too formal. Yes, I’m returning STATUS_PENDING.
> Also, I’m not pending the IRP until after it’s been successfully cached,
> and only placing the transaction in the write queue after pending the
> IRP. That shouldn’t provide any opportunity for early completion.
>
> An interesting datapoint from last night’s test run. After trapping 2
> missing MDLs, the driver still crashed with an access violation when
> trying to mark an IRP pending. That means the IRP somehow lost its
> stack location while being processed. Once again, no indication of a
> cancellation.
>
> > Are you using the IOCTL caching scheme from the book? If so, can you
> > post relevant extracts from your code to show where you’re caching and
> > uncaching the IRP?
>
> Unfortunately, I’m not. The device has the potential to present replies
> out of order, and some transactions are initiated by the driver itself,
> so I need to identify the returning transactions and determine whether a
> pending IRP is waiting for this reply. A FIFO cache won’t work, and I
> need to also store the transaction signature for later lookup.
>
> > Potentially, switching to the 2d edition version of the wizard and
> > GENERIC (including service packs) will improve things.
>
> I think I’ll push for that. And I’ll extract some code and post it.
>
> Thanks very much for replying.
>
>
>

Yes I agree that most likely the irp is being completed. Why not look at
the irp with windbg. !irp will do. Also !irpfind will list all irps but
takes a long time to complete. With !irp you can see if the irp is
completed. Also if it does not show with !irpfind or !irp says this is
not and irp then indeed someone completed it.
I would look at the driver stack with !drvobj to get all deviceobjects and
then !devobj to look at each device stack.
You could also place in some debug code when you get the NULL pointer to
have a DbgBreakPoint() and then look at the what you think is a non
completed irp.

I may be barking up the wrong tree, but how about a memory write breakpoint
at the location of the IPR pointer ? You might catch precisely who’s
reseting it to null, look at the processor stack and maybe you’ll catch the
culprit.

Alberto.

-----Original Message-----
From: Roy M. Silvernail [mailto:xxxxx@parker.com]
Sent: Wednesday, September 03, 2003 9:01 AM
To: Windows System Software Developers Interest List
Subject: [ntdev] Re: Problem with disappearing MDL

Walter Oney wrote:

xxxxx@parker.com wrote:

>Symptom: On rare occasions, the MdlAddress pointer in the IRP
>“disappears” (becomes reset to NULL) between the time DispatchControl is
>called and the time the transaction handler attempts to place the IRP
>into the internal cache.

The only way I can imagine this happening is that the IRP is somehow
getting completed. When you said that you marked this IRP pending and
returned an “appropriate value”, the only “appropriate value” would be
STATUS_PENDING.

OK, I was sounding a bit too formal. Yes, I’m returning STATUS_PENDING.
Also, I’m not pending the IRP until after it’s been successfully cached,
and only placing the transaction in the write queue after pending the
IRP. That shouldn’t provide any opportunity for early completion.

An interesting datapoint from last night’s test run. After trapping 2
missing MDLs, the driver still crashed with an access violation when
trying to mark an IRP pending. That means the IRP somehow lost its
stack location while being processed. Once again, no indication of a
cancellation.

Are you using the IOCTL caching scheme from the book? If so, can you
post relevant extracts from your code to show where you’re caching and
uncaching the IRP?

Unfortunately, I’m not. The device has the potential to present replies
out of order, and some transactions are initiated by the driver itself,
so I need to identify the returning transactions and determine whether a
pending IRP is waiting for this reply. A FIFO cache won’t work, and I
need to also store the transaction signature for later lookup.

Potentially, switching to the 2d edition version of the wizard and
GENERIC (including service packs) will improve things.

I think I’ll push for that. And I’ll extract some code and post it.

Thanks very much for replying.


Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256

You are currently subscribed to ntdev as: xxxxx@compuware.com
To unsubscribe send a blank email to xxxxx@lists.osr.com

The contents of this e-mail are intended for the named addressee only. It
contains information that may be confidential. Unless you are the named
addressee or an authorized designee, you may not copy or use it, or disclose
it to anyone else. If you received it in error please notify us immediately
and then destroy it.

“Roy M. Silvernail” wrote in message
news:xxxxx@ntdev…
>
> Walter Oney wrote:
> > xxxxx@parker.com wrote:
> >
> >>Symptom: On rare occasions, the MdlAddress pointer in the IRP
> >>“disappears” (becomes reset to NULL) between the time DispatchControl is
> >>called and the time the transaction handler attempts to place the IRP
> >>into the internal cache.
> >

I hate to ask the incredibly obvious question: Did you run this driver under
Verifier?? CUV??

That should find problems like this in no time…

Peter
OSR

“Roy M. Silvernail” wrote:

OK, I was sounding a bit too formal. Yes, I’m returning STATUS_PENDING.
Also, I’m not pending the IRP until after it’s been successfully cached,
and only placing the transaction in the write queue after pending the
IRP. That shouldn’t provide any opportunity for early completion.

Yes, it does. As soon as you put a pointer to an IRP someplace where
another part of your driver can pick it up on an asynchronous path, you
create the possibility that the pointer is immediately stale. The only
safe way to handle this situation is to unconditionally mark the IRP
pending, cache it, and then return STATUS_PENDING. From then on, the
only people allowed to touch the IRP are the cancel routine, the
IRP_MJ_CLEANUP handler, and the guy who uncaches it to complete it. They
have to interlock with each other. This is pretty hairy to get right,
which is why I went to so much trouble with the IOCTL caching code
that’s in my book.


Walter Oney, Consulting and Training
Basic and Advanced Driver Programming Seminars
Check out our schedule at http://www.oneysoft.com

> Thus far, everything you have described points to unexpected completion.

I agree.

> OK, I was sounding a bit too formal. Yes, I’m returning STATUS_PENDING.
> Also, I’m not pending the IRP until after it’s been successfully cached,
> and only placing the transaction in the write queue after pending the
> IRP. That shouldn’t provide any opportunity for early completion.

I don’t know the code, but it sounds to me like there is a window here.
What if the transaction completes before being placed in the write queue?
It sounds like you would have it in the cache, but not the write queue?
Would this cause problems?

Loren

Loren Wilton wrote:

>Thus far, everything you have described points to unexpected completion.

I agree.

And after roughly 18 hours of test run with no anomalies trapped, I also
agree. Unexpected completion seems to have been the culprit, and
reworking the locking strategy between caching and uncaching appears to
have addressed it.

I don’t know the code, but it sounds to me like there is a window here.
What if the transaction completes before being placed in the write queue?
It sounds like you would have it in the cache, but not the write queue?
Would this cause problems?

Yes, that would be a problem, and I have to say I don’t see how this is
possible. But that doesn’t make it impossible… just means I don’t
know what to look for. If a transaction never enters the write queue,
the device will never fulfill it and the read service /shouldn’t/
receive a reply. The chronology of this fault sure does look inverted,
but Mr. Oney warned about that in his book.

In any case, I think I’m at least on the right track. Now off to study
Walter’s IOCTL caching in detail to make sure my solution is properly
bullet-resistant.

Thanks to all who provided clues!

Why not use METHOD_BUFFERED for this IOCTL?

Maxim Shatskih, Windows DDK MVP
StorageCraft Corporation
xxxxx@storagecraft.com
http://www.storagecraft.com

----- Original Message -----
From:
To: “Windows System Software Developers Interest List”
Sent: Tuesday, September 02, 2003 7:32 PM
Subject: [ntdev] Problem with disappearing MDL

> I’m trying to chase down an obscure bug in a hardware device driver.
>
> Background: This is the only driver in the stack for the target
> hardware. No IRPs are ever ‘sent down’. The driver is built from
> Oney’s framework as published in the first edition of “Programming the
> Microsoft Windows Driver Model”.
>
> Scenario: Userland performs a query through IOCTL. This query is
> placed in the write queue for the device, the IRP is locally cached,
> marked pending and DispatchControl() returns an appropriate value. When
> the periodic read service picks up the response to the query, the
> initiating IRP is recovered from the internal cache, data is copied to
> the userland buffer and the IRP is completed.
>
> Symptom: On rare occasions, the MdlAddress pointer in the IRP
> “disappears” (becomes reset to NULL) between the time DispatchControl is
> called and the time the transaction handler attempts to place the IRP
> into the internal cache. A userland pointer is unconditionally
> recovered from the MdlAddress field as DispatchControl enters, using
> MmGetSystemAddressForMdl. (I know it’s deprecated, but the driver has
> to run on Win98 as well) DispatchControl calls a transaction handler
> function to place the query in the write queue and cache the IRP for
> later completion. The handler calls a cache function to perform the add
> to the internal cache. It is at this point (adding the pending
> transaction to the internal cache) that the MdlAddress pointer is tested.
>
> Questions: What could cause the MdlAddress pointer to be NULLed between
> the initial MmGetSystemAddressForMdl call and the attempt to add to the
> internal cache? Would moving the driver to the second edition version
> of Oney’s framework address this?
>
> On a related note: The userland query function has a timeout value and
> the buffer provided for retrieving the query response is an automatic
> variable. (legacy library code… I’m not supposed to change it without
> a lot of justification) If a query should take longer that
> user_timeout to complete, the buffer represented by the MdlAddress field
> of the cached IRP will no longer exist. KdPrint statements in the
> OnCancel routine never show up in the debug console, so presumably the
> IRP is not cancelled. What happens to MdlAddress? (I’m theorizing that
> timeouts may be occurring under heavy device loads… the symptom only
> seems to occur with one user’s userland program, which is not
> instrumented very well for debugging)
>
> Many thanks in advance for any clues you may have to offer.
> –
> Roy M. Silvernail - xxxxx@parker.com
>
>
>
>
>
>
> -----------------------------------------
> “PLEASE NOTE: The preceding information may be confidential or privileged. It
only should be used or disseminated for the purpose of conducting business with
Parker. If you are not an intended recipient, please notify the sender by
replying to this message and then delete the information from your system.
Thank you for your cooperation.”
>
>
>
> —
> Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256
>
> You are currently subscribed to ntdev as: xxxxx@storagecraft.com
> To unsubscribe send a blank email to xxxxx@lists.osr.com

Maxim S. Shatskih wrote:

Why not use METHOD_BUFFERED for this IOCTL?

Fair question. The IOCTL in question is the primary handler for all
structured transactions between userland and our device. The original
developer decided not to use METHOD_BUFFERED to avoid having the system
perform constant data copying. This IOCTL is called a *lot*.

FYI, the transaction guarding logic turned out to be faulty. Once I
corrected that, I got >48 hours without a problem. (would have been
more, but the office took a power hit over the weekend).