Thanks very much for all your replies. After 2 weeks’s desperately debugging, unfortunately I still can not fix this problem. But I still want to give an update here. We have opened one ticket with Microsoft WDK group and provided 2 dump files to Microsoft engineers. However no solution found yet.
I really do not think our DPC function is the root reason. The same driver code has been used almost for 10 years from Win7 to Win10 (1607, 1703, 1709 and 1803 version) and no problems. Only for the latest Win10 1809 version, the BSOD can consistently happen (with the same hardware).
My debug shows the root reason is from the “PutScatterGatherList()” function inside of our DPC. The “PutScatterGatherList()” can not finish and take forever to run, which I do not understand.
To prove this, I did another test to disable all the interrupts in our driver to enforce the DPC function not running. Since the “PutScatterGatherList()” function is also used in our “CancelActiveRead()” function, I can make another BSOD by canceling the DMA read only without DPC running.
Our “CancelActiveRead()” function has the code like:
VOID CancelActiveRead( PDEVICE_OBJECT DeviceObject, PIRP Irp)
{
//-----------------------------------------------------------------------------
PGEMINI_DEVICE_EXT DevExt = (PGEMINI_DEVICE_EXT)DeviceObject->DeviceExtension;
KIRQL OldIrql;
//-----------------------------------------------------------------------------
DbgPrint("\nCancelActiveRead() enter()\n");
...... // Some code here
//-------------------------------------------------------------------------
KeAcquireSpinLock(&DevExt->ImageReadStateLock, &OldIrql);
DbgPrint("\nCancelActiveRead() get ImageReadStateLock\n");
//
// This drives everything practically into reset
//
SetPCICardReset(DevExt);
//
// Clear the Transfer length
//
ProgramDmaTransferLength (DevExt, 0);
//
// Undo some of the resets
//
ClearPCICardReset(DevExt);
//
// Disarm the Card! - yes we are doing this underneath the
// app sort of, except that either the app told up to CancelIo,
// or Windows is killing the thread...
//
ClearControlRegisterBits( DevExt, CTRL_BIT_ARM );
SetControlRegisterBits( DevExt, CTRL_BIT_FLUSH_INPUT );
//
// If these are equal, the DPC hasn't seen it yet
//
if (DevExt->CurrentReadIrp == Irp)
DevExt->CurrentReadIrp = NULL;
//
// Return the scatter/gather list we've been using
//
// Third parameter is FALSE because this is a Read
// operation
//
DbgPrint("\nCancelActiveRead() Call PutScatterGatherList()\n");
DevExt->ReadAdapter->DmaOperations->PutScatterGatherList(
DevExt->ReadAdapter,
DevExt->ReadScatterGatherList,
FALSE);
DbgPrint("\nCancelActiveRead() end PutScatterGatherList()\n");
//
// Make sure we don't have a stale pointer
//
DevExt->ReadScatterGatherList = NULL;
KeReleaseSpinLock(&DevExt->ImageReadStateLock, OldIrql);
DbgPrint("\nCancelActiveRead() release ImageReadStateLock\n");
//-------------------------------------------------------------------------
....... // other codes here
.......
}
From the WinDbg, I got
"CancelActiveRead() enter()
CancelActiveRead() get ImageReadStateLock
CancelActiveRead() Call PutScatterGatherList()
"
NO ANY output after this.
Since I added two dbgprints just before and after the “PutScatterGatherList()”, I can only see the first dbgprint is displayed but the second dbgprint never show up. And the target machine freezes.
In our driver, the “PutScatterGatherList()” function is only used in “CancelActiveRead()” and DPC function. Therefore I think the “PutScatterGatherList()” take forever to run is the root problem caused the BSOD. Because it can not finish forever, finially triggered the DPC_WATCHDOG_VIOLATION.
I just do not understand what causes the “PutScatterGatherList()” can not finish. And I still have no clue how to fix it.