Endless high-cpu usage on latest 10x86 builds calling CancelIO during thread stop and DriverVerifier

OriSei · September 7, 2021, 7:55am

In our CI environments, we’ve detected that many tests on latest Windows 10 x86 releases are stuck when we stop our service that communicates with our driver. This issue only happens with driver verifier enabled using this configuration targeting our driver:

Verifier Flags: 0x03ab8dbb

  Standard Flags:

    [X] 0x00000001 Special pool.
    [X] 0x00000002 Force IRQL checking.
    [X] 0x00000008 Pool tracking.
    [X] 0x00000010 I/O verification.
    [X] 0x00000020 Deadlock detection.
    [X] 0x00000080 DMA checking.
    [X] 0x00000100 Security checks.
    [X] 0x00000800 Miscellaneous checks.
    [X] 0x00020000 DDI compliance checking.

  Additional Flags:

    [ ] 0x00000004 Randomized low resources simulation.
    [ ] 0x00000200 Force pending I/O requests.
    [X] 0x00000400 IRP logging.
    [ ] 0x00002000 Invariant MDL checking for stack.
    [ ] 0x00004000 Invariant MDL checking for driver.
    [X] 0x00008000 Power framework delay fuzzing.
    [X] 0x00010000 Port/miniport interface checking.
    [ ] 0x00040000 Systematic low resources simulation.
    [X] 0x00080000 DDI compliance checking (additional).
    [X] 0x00200000 NDIS/WIFI verification.
    [X] 0x00800000 Kernel synchronization delay fuzzing.
    [X] 0x01000000 VM switch verification.
    [X] 0x02000000 Code integrity checks.

Things to keep in mind:

This issue has not been detected in Windows 10 x64 machines with the same driver verifier settings.
This issue has not been reproduced on any machine having driver verifier disabled.
The same 32-bit driver binary is tested on all x86 machines, and, likewise, the same 64-bit driver binary is tested on all x64 machines.
On these OSs the issue is reproduced with driver verifier (reproduction is 100% consistent): 10x86-20H1, 10x86-20H2, 10x86-21H1.
On these OSs the issue has not been reproduced with driver verifier: 08x64, 08x86, WS-2016, WS-2019, 10x64-1507, 10x86-1507, 10x64-1809, 10x86-1809, 10x64-20H1, 10x64-20H2, 10x64-21H1.
AppVerifier appears in the stacks below, but the issue is still reproduced without it.

Scenario:

When our service starts, it creates several threads to communicate with our driver. On each thread, 4 IRPs are sent to the driver to retrieve data asynchronously. When the service stops, we cancel any pending IO on that thread. All those IRPs are handled in the driver with a cancel safe queue.

When the issue happens, we saw that the system calls our registered csq cancel routine for the first IRP, as seen in this stack (in windbg we placed a bp on nxtrdrv!NxtCsqCompleteCanceledIrp):

    nxtrdrv!NxtCsqCompleteCanceledIrp
    nt!IopCsqCancelRoutine+0x82 (FPO: [Non-Fpo])
    VerifierExt!xdv_DriverCancel_wrapper+0xef (FPO: [Non-Fpo])
    nt!IovpCancelRoutine+0x3f (FPO: [Non-Fpo])
    nt!IoCancelIrp+0x16f4cc
    nt!NtCancelIoFile+0x8d (FPO: [Non-Fpo])
    nt!KiSystemServicePostCall (FPO: [0,3] TrapFrame @ b7157c14)
    ntdll!KiFastSystemCallRet (FPO: [0,0,0])
    ntdll!NtCancelIoFile+0xa (FPO: [2,0,0])
    vfbasics!AVrfpNtCancelIoFile+0x29 (FPO: [Non-Fpo])
    KERNELBASE!CancelIo+0x14 (FPO: [Non-Fpo])
    nxtsvc!CWebAppMonitor::CancelAllIoctls+0x76 (Inline Function @ 00c743b3) (CONV: thiscall)
    nxtsvc!CWebAppMonitor::MonitorThread+0x253 (FPO: [0,3,0]) (CONV: thiscall)
    nxtsvc!WebAppThread+0x9 (FPO: [1,0,0]) (CONV: stdcall)
    vfbasics!AVrfpStandardThreadFunction+0x48 (FPO: [Non-Fpo])
    KERNEL32!BaseThreadInitThunk+0x19 (FPO: [Non-Fpo])
    ntdll!__RtlUserThreadStart+0x2b (FPO: [Non-Fpo])
    ntdll!_RtlUserThreadStart+0x1b (FPO: [Non-Fpo])

And our implementation is essentially a copy of what can be found in MSDN documentation:

    VOID 
    NxtCsqCompleteCanceledIrp(
        PIO_CSQ pCsq,
        PIRP    pIrp
       )
    {
        UNREFERENCED_PARAMETER(pCsq);

        pIrp->IoStatus.Status = STATUS_CANCELLED;
        pIrp->IoStatus.Information = 0;
        IoCompleteRequest(pIrp, IO_NO_INCREMENT);
    }

If we continue the debugging session, the bp won’t be triggered again for any of the pending IRPs on that thread. In fact, we paused windbg the after 5min, and the new stack is:

    nt!RtlpBreakWithStatusInstruction (FPO: [1,0,0])
    nt!KdCheckForDebugBreak+0x96197
    nt!KiUpdateRunTime+0x3f (FPO: [Non-Fpo])
    nt!KiUpdateTime+0x1e5 (FPO: [Non-Fpo])
    nt!KeClockInterruptNotify+0x16a (FPO: [Non-Fpo])
    hal!HalpTimerClockInterrupt+0xf6 (FPO: [Non-Fpo])
    nt!KiCallInterruptServiceRoutine+0x239 (FPO: [Non-Fpo])
    nt!KiInterruptDispatchNoLockNoEtwNoStack+0x2b (FPO: [Non-Fpo])
    nt!KiUnexpectedInterruptTail+0x38d (FPO: [0,2] TrapFrame @ b7157b20)
    nt!KxWaitForLockChainValid+0x17 (FPO: [0,0,0])
    nt!IopCsqCancelRoutine+0xad (FPO: [Non-Fpo])
    WARNING: Frame IP not in any known module. Following frames may be wrong.
    0xb89d7f30
    nt!NtCancelIoFile+0x8d (FPO: [Non-Fpo])
    nt!KiSystemServicePostCall (FPO: [0,3] TrapFrame @ b7157c14)
    ntdll!KiFastSystemCallRet (FPO: [0,0,0])
    ntdll!NtCancelIoFile+0xa (FPO: [2,0,0])
    vfbasics!AVrfpNtCancelIoFile+0x29 (FPO: [Non-Fpo])
    KERNELBASE!CancelIo+0x14 (FPO: [Non-Fpo])
    nxtsvc!CWebAppMonitor::CancelAllIoctls+0x76 (Inline Function @ 00c743b3)
    nxtsvc!CWebAppMonitor::MonitorThread+0x253 (FPO: [0,3,0]) (CONV: thiscall)
    nxtsvc!WebAppThread+0x9 (FPO: [1,0,0]) (CONV: stdcall)
    vfbasics!AVrfpStandardThreadFunction+0x48 (FPO: [Non-Fpo])
    KERNEL32!BaseThreadInitThunk+0x19 (FPO: [Non-Fpo])
    ntdll!__RtlUserThreadStart+0x2b (FPO: [Non-Fpo])
    ntdll!_RtlUserThreadStart+0x1b (FPO: [Non-Fpo])

Given that we are still inside nt!IopCsqCancelRoutine apparently the execution continued, but then it got stuck inside KxWaitForLockChainValid. At this point the thread still has the 4 IRPs: the first one marked as completed and the rest pending.

Has anyone experienced an issue like this one?

Best regards,
Rodrigo Alonso.