IoCancelIrp jumping off into the weeds

Step 1: Yes. I encountered a potential deadlock (TODO: FIXME) in an unrelated area of code but nothing in the PCAP code.

Step 2: No, haven’t tried that yet. Good idea.

I’d fix that and re-try. Or, be more lazy and disable deadlock detection in Driver Verifier and retry.

IF this is a “double free” of an IRP… Driver Verifier is likely to be your fastest and bestest way to run this problem down. Be sure to turn on all the options that might help, and disable the ones (like the low resource simulations, and I guess deadlock detection in your case) that don’t matter.

Peter
OSR
@OSRDrivers

A mildly interesting thing that I just confirmed is that the behavior has evolved with kernel versions.

I had reverted to an older Win10 snapshot and the symptoms were different than they are with the Fall creators update.

The main thread does a WaitOnSingleObject on the child thread before finishing up and writing out the buffers. Pre-creators update this would just hang. The child thread would never seem to finish and signal the waiting thread. The child didn’t seem to actually exist though, at least as far as windbg is concerned. Like a zombie thread or something.

With the newer builds it displays the crashing behavior as described above. Anyway, off to figure out how to get the checked build going.

That’s a good idea too. Verifier is certainly easier to start with than checked build. The deadlock issue is unfortunately fairly complicated and seems like it will take a fair bit of rework to get to the bottom of.

I’ve inherited a pretty big pile of very old code here and we’re trying to ‘take it back’, establish better practices, etc. I’ve been adding SAL, running static analysis, setting up HLK/HCK tests, Verifier, etc. It’s a big hill to climb.

Is your code using the MSFT cancel safe queue mechanism or is it so old
that it rolled its own cancel processing?

Mark Roddy

On Mon, Apr 23, 2018 at 11:48 AM, xxxxx@pleora.com <
xxxxx@lists.osr.com> wrote:

That’s a good idea too. Verifier is certainly easier to start with than
checked build. The deadlock issue is unfortunately fairly complicated and
seems like it will take a fair bit of rework to get to the bottom of.

I’ve inherited a pretty big pile of very old code here and we’re trying to
‘take it back’, establish better practices, etc. I’ve been adding SAL,
running static analysis, setting up HLK/HCK tests, Verifier, etc. It’s a
big hill to climb.


NTDEV is sponsored by OSR

Visit the list online at: http:> showlists.cfm?list=ntdev>
>
> MONTHLY seminars on crash dump analysis, WDF, Windows internals and
> software drivers!
> Details at http:
>
> To unsubscribe, visit the List Server section of OSR Online at <
> http://www.osronline.com/page.cfm?name=ListServer&gt;
></http:></http:>

>It’s a big hill to climb.

I sympathize.

Free hint: Check the Cleanup vs Cancel logic. Over the past few years, I’ve seen a whole raft of drivers that confuse Cleanup and Cancel… and in fact use Cleanup INSTEAD of Cancel.

https://www.osr.com/nt-insider/2017-issue2/handling-cleanup-close-cancel-wdf-driver/

Peter
OSR
@OSRDrivers

In addition to the other comments…

If you have Verifier enabled you can try finding the IRP in the IRP
alloc/complete/cancel log:

!verifier 100 address

Also, can you do a “dt nt!_IRP address” and show the IRP? Lastly, what does
“!irp address 1” show? My goal with these is to categorize the state of the
IRP a little better (e.g. is the whole thing garbage or does it look
“normal”)

-scott
OSR
@OSRDrivers

> Is your code using the MSFT cancel safe queue mechanism or is it so old

that it rolled its own cancel processing?

Definitely doing it’s own cancel processing. One strange thing that I don’t understand yet is why the cancel handler never gets set for the IRPs of this particular filter.

The driver is architected like 3 filters in one. A receiver, transmitter and pcap recorder. They seem to follow slightly different IRP handling paths though so maybe I need to try to reconcile what they’re doing.

Thanks. I’ll check that article out.

Free hint: Check the Cleanup vs Cancel logic. Over the past few years, I’ve
seen a whole raft of drivers that confuse Cleanup and Cancel… and in fact use
Cleanup INSTEAD of Cancel.

How do I figure out what the IRP address is?

If you have Verifier enabled you can try finding the IRP in the IRP
alloc/complete/cancel log:

!verifier 100 address

Also, can you do a “dt nt!_IRP address” and show the IRP? Lastly, what does
“!irp address 1” show? My goal with these is to categorize the state of the
IRP a little better (e.g. is the whole thing garbage or does it look
“normal”)

NM, I think I found it. It was at the top of the list (unsurprising I guess).
2: kd> !verifier 0x100 fffffd80d5412ea0


The log entries are displayed in reverse chronological order.

IRP fffffd80d5412ea0, Thread ffffa38df4b0c700, IRQL = 1, KernelApcDisable = 0, SpecialApcDisable = 0
fffff800833afdb2 nt!IovCancelIrp+0xe
fffff80082dcab13 nt!IoCancelIrp+0xe5553
fffff80083151043 nt!IoCancelThreadIo+0x5b
fffff8008315090a nt!PspExitThread+0x266
fffff800831512a6 nt!PspTerminateThreadByPointer+0x96
fffff8008315166e nt!NtTerminateThread+0x4a
fffff80082d83553 nt!KiSystemServiceCopyEnd+0x13

IRP fffffd80d5412ea0, Thread ffffa38df4b0c700, IRQL = 0, KernelApcDisable = 0, SpecialApcDisable = 0
fffff800833af650 nt!IovAllocateIrp+0x30
fffff80082d9e567 nt!IopAllocateIrpExReturn+0x151c67
fffff800830e0b1c nt!IopXxxControlFile+0x3ec
fffff800830e0716 nt!NtDeviceIoControlFile+0x56
fffff80082d83553 nt!KiSystemServiceCopyEnd+0x13

IRP fffffd80d5412ea0, Thread ffffa38df4334700, IRQL = 0, KernelApcDisable = 0, SpecialApcDisable = 0
fffff800833afe16 nt!IovCompleteRequest+0x2e
fffff80082daaa63 nt!IofCompleteRequest+0x1304f3
fffff80d26f1866d VerifierExt!IofCompleteRequest_wrapper+0xbd
fffff800833b4b30 nt!VerifierIofCompleteRequest+0x10
fffff80d28d56e1b ebUniversalProForEthernet!PIG_UnmapRequest+0x9b
fffff80d28d56338 ebUniversalProForEthernet!PIG_IoControlWaitRequests+0x198
fffff80d28d55728 ebUniversalProForEthernet!PIG_IoControl+0x98
fffff80d28d420c8 ebUniversalProForEthernet!DIS_PlugInIoControl+0x28
fffff80d28d5972e ebUniversalProForEthernet!WFT_DispatchIoControl+0x1ee
fffff80d27cc368b ndis!ndisDummyIrpHandler+0x8b
fffff800833cd548 nt!ViGenericDispatchHandler+0x54
fffff800833cd4b6 nt!ViGenericDeviceControl+0x16
fffff80082d2dcfa nt!IopfCallDriver+0x56

IRP fffffd80d5412ea0, Thread ffffa38df4334700, IRQL = 0, KernelApcDisable = 0, SpecialApcDisable = 0
fffff800833af650 nt!IovAllocateIrp+0x30
fffff80082d9e567 nt!IopAllocateIrpExReturn+0x151c67
fffff800830e0b1c nt!IopXxxControlFile+0x3ec
fffff800830e0716 nt!NtDeviceIoControlFile+0x56
fffff80082d83553 nt!KiSystemServiceCopyEnd+0x13
Parsing entry 0xd88/0x4000
IRP fffffd80d5412ea0, Thread ffffa38df4334700, IRQL = 0, KernelApcDisable = 0, SpecialApcDisable = 0
fffff800833afe16 nt!IovCompleteRequest+0x2e
fffff80082daaa63 nt!IofCompleteRequest+0x1304f3
fffff80d26f1866d VerifierExt!IofCompleteRequest_wrapper+0xbd
fffff800833b4b30 nt!VerifierIofCompleteRequest+0x10
fffff80d28d597c7 ebUniversalProForEthernet!WFT_DispatchIoControl+0x287
fffff80d27cc368b ndis!ndisDummyIrpHandler+0x8b
fffff800833cd548 nt!ViGenericDispatchHandler+0x54
fffff800833cd4b6 nt!ViGenericDeviceControl+0x16
fffff80082d2dcfa nt!IopfCallDriver+0x56
fffff800833afd5d nt!IovCallDriver+0x275
fffff80082d9e54b nt!IofCallDriver+0x151ccb
fffff800830e155e nt!IopSynchronousServiceTail+0x19e
fffff800830e0d9c nt!IopXxxControlFile+0x66c

IRP fffffd80d5412ea0, Thread ffffa38df4334700, IRQL = 0, KernelApcDisable = 0, SpecialApcDisable = 0
fffff800833af650 nt!IovAllocateIrp+0x30
fffff80082d9e567 nt!IopAllocateIrpExReturn+0x151c67
fffff800830e0b1c nt!IopXxxControlFile+0x3ec
fffff800830e0716 nt!NtDeviceIoControlFile+0x56
fffff80082d83553 nt!KiSystemServiceCopyEnd+0x13

IRP fffffd80d5412ea0, Thread ffffa38df1b10080, IRQL = 0, KernelApcDisable = 0, SpecialApcDisable = 0
fffff800833afe16 nt!IovCompleteRequest+0x2e
fffff80082daaa63 nt!IofCompleteRequest+0x1304f3
fffff80d28f225ee nsiproxy!NsippDispatch+0xae
fffff80082d2dcfa nt!IopfCallDriver+0x56
fffff800833afd5d nt!IovCallDriver+0x275
fffff80082d9e54b nt!IofCallDriver+0x151ccb
fffff800830e155e nt!IopSynchronousServiceTail+0x19e
fffff800830e0d9c nt!IopXxxControlFile+0x66c
fffff800830e0716 nt!NtDeviceIoControlFile+0x56
fffff80082d83553 nt!KiSystemServiceCopyEnd+0x13

IRP fffffd80d5412ea0, Thread ffffa38df1b10080, IRQL = 0, KernelApcDisable = 0, SpecialApcDisable = 0
fffff800833af650 nt!IovAllocateIrp+0x30
fffff80082d9e567 nt!IopAllocateIrpExReturn+0x151c67
fffff800830e0b1c nt!IopXxxControlFile+0x3ec
fffff800830e0716 nt!NtDeviceIoControlFile+0x56
fffff80082d83553 nt!KiSystemServiceCopyEnd+0x13

IRP fffffd80d5412ea0, Thread ffffa38df4205080, IRQL = 0, KernelApcDisable = 0, SpecialApcDisable = 0
fffff800833afe16 nt!IovCompleteRequest+0x2e
fffff80082daaa63 nt!IofCompleteRequest+0x1304f3
fffff80d26f1866d VerifierExt!IofCompleteRequest_wrapper+0xbd
fffff800833b4b30 nt!VerifierIofCompleteRequest+0x10
fffff80d28d59ba6 ebUniversalProForEthernet!WFT_DispatchIoControl+0x666
fffff80d27cc368b ndis!ndisDummyIrpHandler+0x8b
fffff800833cd548 nt!ViGenericDispatchHandler+0x54
fffff800833cd4b6 nt!ViGenericDeviceControl+0x16
fffff80082d2dcfa nt!IopfCallDriver+0x56
fffff800833afd5d nt!IovCallDriver+0x275
fffff80082d9e54b nt!IofCallDriver+0x151ccb
fffff800830e155e nt!IopSynchronousServiceTail+0x19e
fffff800830e0d9c nt!IopXxxControlFile+0x66c

IRP fffffd80d5412ea0, Thread ffffa38df4205080, IRQL = 0, KernelApcDisable = 0, SpecialApcDisable = 0
fffff800833af650 nt!IovAllocateIrp+0x30
fffff80082d9e567 nt!IopAllocateIrpExReturn+0x151c67
fffff800830e0b1c nt!IopXxxControlFile+0x3ec
fffff800830e0716 nt!NtDeviceIoControlFile+0x56
fffff80082d83553 nt!KiSystemServiceCopyEnd+0x13
Parsing entry 0x3fff/0x4000

2: kd> dt nt!_IRP fffffd80d5412ea0
+0x000 Type : 0n6
+0x002 Size : 0x160
+0x004 AllocationProcessorNumber : 0
+0x006 Reserved : 0
+0x008 MdlAddress : (null)
+0x010 Flags : 0x40060070
+0x018 AssociatedIrp :
+0x020 ThreadListEntry : _LIST_ENTRY [0xfffffd80d272cec0 - 0xffffa38df4b0cd70]
+0x030 IoStatus : _IO_STATUS_BLOCK
+0x040 RequestorMode : 1 ‘’
+0x041 PendingReturned : 0 ‘’
+0x042 StackCount : 1 ‘’
+0x043 CurrentLocation : 1 ‘’
+0x044 Cancel : 0x1 ‘’
+0x045 CancelIrql : 0x1 ‘’
+0x046 ApcEnvironment : 0 ‘’
+0x047 AllocationFlags : 0x1 ‘’
+0x048 UserIosb : 0x000001af04a7b468 _IO_STATUS_BLOCK<br> +0x050 UserEvent : (null) <br> +0x058 Overlay : <unnamed-tag><br> +0x068 CancelRoutine : (null) <br> +0x070 UserBuffer : 0x00000071fd8ff648 Void
+0x078 Tail :

2: kd> !irp fffffd80d5412ea0
Irp is active with 1 stacks 1 is current (= 0xfffffd80d5412f70)
No Mdl: System buffer=fffffd80d511cff0: Thread ffffa38df4b0c700: Irp stack trace.
cmd flg cl Device File Completion-Context
>[IRP_MJ_DEVICE_CONTROL(e), N/A(0)]
0 1 ffffa38df1794460 ffffa38df4253790 00000000-00000000 pending
\Driver\ebUniversalProForEthernet
Args: 00000004 00000010 0x172334 00000000

Irp Extension present at 0xfffffd80d5412fb8:

By the way, I’m now sitting here with my mind completely blown at the depth to which one can drill down with windbg and verifier.

I REALLY want to take that advanced debugging course now. I sure hope you guys get enough people to run it.

It is highly likely that code not using cancel safe queues is implemented
incorrectly.

Mark Roddy

On Tue, Apr 24, 2018 at 10:47 AM, xxxxx@pleora.com <
xxxxx@lists.osr.com> wrote:

> Is your code using the MSFT cancel safe queue mechanism or is it so old
> that it rolled its own cancel processing?

Definitely doing it’s own cancel processing. One strange thing that I
don’t understand yet is why the cancel handler never gets set for the IRPs
of this particular filter.

The driver is architected like 3 filters in one. A receiver, transmitter
and pcap recorder. They seem to follow slightly different IRP handling
paths though so maybe I need to try to reconcile what they’re doing.


NTDEV is sponsored by OSR

Visit the list online at: http:> showlists.cfm?list=ntdev>
>
> MONTHLY seminars on crash dump analysis, WDF, Windows internals and
> software drivers!
> Details at http:
>
> To unsubscribe, visit the List Server section of OSR Online at <
> http://www.osronline.com/page.cfm?name=ListServer&gt;
></http:></http:>

From what I?ve been reading, I tend to agree. This is a very old driver and it?s not using the WDF at all unfortunately. To add insult to injury, it?s ALSO a ?cross-platform? driver in that most of the back end data processing code runs on Linux, Windows and OSX using abstractions for the OS specific primitives and structures. You can imagine there are some subtleties lost when the only lock being used is a spinlock. I recently fixed some IRQL_NOT_LESS_OR_EQUAL issues where a mutex was needed instead.

In any event, rewriting it for WDF isn?t in the cards so I?m stuck with it.

It is highly likely that code not using cancel safe queues is implemented
incorrectly.

A few things seem odd to me.

IRQL is 1. Relevant?

The fact that our driver doesn’t seem to be in the stack of the last request. I guess that means that either we never completed the request or never even got it at all. I’m assuming the IRP gets handed to the driver pretty much instantly after it’s allocated though.

The IRP itself has a NULL CancelRoutine so now I REALLY don’t understand why the IP is jumping off into a random memory location base on my reading of https://docs.microsoft.com/en-us/windows-hardware/drivers/ddi/content/wdm/nf-wdm-iocancelirp

Cancel safe queues are implemented in WDM. See:
https://docs.microsoft.com/en-us/windows-hardware/drivers/kernel/cancel-safe-irp-queues
.

Mark Roddy

On Tue, Apr 24, 2018 at 12:37 PM, xxxxx@pleora.com <
xxxxx@lists.osr.com> wrote:

From what I?ve been reading, I tend to agree. This is a very old driver
and it?s not using the WDF at all unfortunately. To add insult to injury,
it?s ALSO a ?cross-platform? driver in that most of the back end data
processing code runs on Linux, Windows and OSX using abstractions for the
OS specific primitives and structures. You can imagine there are some
subtleties lost when the only lock being used is a spinlock. I recently
fixed some IRQL_NOT_LESS_OR_EQUAL issues where a mutex was needed instead.

In any event, rewriting it for WDF isn?t in the cards so I?m stuck with it.

> It is highly likely that code not using cancel safe queues is implemented
> incorrectly.


NTDEV is sponsored by OSR

Visit the list online at: http:> showlists.cfm?list=ntdev>
>
> MONTHLY seminars on crash dump analysis, WDF, Windows internals and
> software drivers!
> Details at http:
>
> To unsubscribe, visit the List Server section of OSR Online at <
> http://www.osronline.com/page.cfm?name=ListServer&gt;
></http:></http:>

Ah! Very interesting. I still don’t fully understand the code but I’m not convinced we’re queuing IRPs at all. It looks like we just keep it until we’re done with it and don’t expect any more to come in afterwards. I’m beginning to think this driver is not exceptionally well implemented.

You’ve given me some really good reading material. Thanks.

Cancel safe queues are implemented in WDM. See:
https://docs.microsoft.com/en-us/windows-hardware/drivers/kernel/cancel-safe-irp-
queues

The Irp CancelRoutine field is set to null before your cancel routine is
called.

Mark Roddy

On Tue, Apr 24, 2018 at 1:36 PM, xxxxx@pleora.com <
xxxxx@lists.osr.com> wrote:

A few things seem odd to me.

IRQL is 1. Relevant?

The fact that our driver doesn’t seem to be in the stack of the last
request. I guess that means that either we never completed the request or
never even got it at all. I’m assuming the IRP gets handed to the driver
pretty much instantly after it’s allocated though.

The IRP itself has a NULL CancelRoutine so now I REALLY don’t understand
why the IP is jumping off into a random memory location base on my reading
of https://docs.microsoft.com/en-us/windows-hardware/drivers/
ddi/content/wdm/nf-wdm-iocancelirp


NTDEV is sponsored by OSR

Visit the list online at: http:> showlists.cfm?list=ntdev>
>
> MONTHLY seminars on crash dump analysis, WDF, Windows internals and
> software drivers!
> Details at http:
>
> To unsubscribe, visit the List Server section of OSR Online at <
> http://www.osronline.com/page.cfm?name=ListServer&gt;
></http:></http:>

Oh! So it could very well have had an invalid routine set. All of the places where it’s set in our code seem to be setting it to NULL from what I see but maybe it’s using some uninitialized memory. I’ll need to look more closely at the code around that region.

The Irp CancelRoutine field is set to null before your cancel routine is
called.

Well son of a gun! I think I found it. We have a cancel routine that works for both the receiver and pcap recorder but for some reason it was never getting set for the pcap filter IRPs. I went and hard-coded it and boom! No more crash.

I’ll need to formalize the fix to see why the original attempt to set the CancelRoutine wasn’t working but this is definitely the right path.

Thank you so much for all your help everyone.